* [PATCH 1/2 v2] Btrfs: fix regression when running delayed references
@ 2015-10-25 10:04 fdmanana
2015-10-25 10:04 ` [PATCH 2/2 v2] Btrfs: fix regression running delayed references when using qgroups fdmanana
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: fdmanana @ 2015-10-25 10:04 UTC (permalink / raw)
To: linux-btrfs; +Cc: stephane_btrfs, Filipe Manana
From: Filipe Manana <fdmanana@suse.com>
In the kernel 4.2 merge window we had a refactoring/rework of the delayed
references implementation in order to fix certain problems with qgroups.
However that rework introduced one more regression that leads to the
following trace when running delayed references for metadata:
[35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
[35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
[35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
[35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
[35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
[35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
[35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
[35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
[35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
[35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
[35908.065201] Stack:
[35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
[35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
[35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
[35908.065201] Call Trace:
[35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
[35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
[35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
[35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
[35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
[35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
[35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
[35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
[35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
[35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
[35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
[35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
[35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP <ffff88010c4cbb08>
[35908.310885] ---[ end trace fe4299baf0666457 ]---
This happens because the new delayed references code no longer merges
delayed references that have different sequence values. The following
steps are an example sequence leading to this issue:
1) Transaction N starts, fs_info->tree_mod_seq has value 0;
2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
bytenr A is created, with a value of 1 and a seq value of 0;
3) fs_info->tree_mod_seq is incremented to 1;
4) Extent buffer A is deleted through btrfs_del_items(), which calls
btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
later returns the metadata extent associated to extent buffer A to
the free space cache (the range is not pinned), because the extent
buffer was created in the current transaction (N) and writeback never
happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
in the extent buffer).
This creates the delayed reference Ref2 for bytenr A, with a value
of -1 and a seq value of 1;
5) Delayed reference Ref2 is not merged with Ref1 when we create it,
because they have different sequence numbers (decided at
add_delayed_ref_tail_merge());
6) fs_info->tree_mod_seq is incremented to 2;
7) Some task attempts to allocate a new extent buffer (done at
extent-tree.c:find_free_extent()), but due to heavy fragmentation
and running low on metadata space the clustered allocation fails
and we fall back to unclustered allocation, which finds the
extent at offset A, so a new extent buffer at offset A is allocated.
This creates delayed reference Ref3 for bytenr A, with a value of -1
and a seq value of 2;
8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
all have different seq values;
9) We start running the delayed references (__btrfs_run_delayed_refs());
10) The delayed Ref1 is the first one being applied, which ends up
creating an inline extent backref in the extent tree;
10) Next the delayed reference Ref3 is selected for execution, and not
Ref2, because select_delayed_ref() always gives a preference for
positive references (that have an action of BTRFS_ADD_DELAYED_REF);
11) When running Ref3 we encounter alreay the inline extent backref
in the extent tree at insert_inline_extent_backref(), which makes
us hit the following BUG_ON:
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
This is always true because owner corresponds to the level of the
extent buffer/btree node in the btree.
For the scenario described above we hit the BUG_ON because we never merge
references that have different seq values.
We used to do the merging before the 4.2 kernel, more specifically, before
the commmits:
c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
This issue became more exposed after the following change that was added
to 4.2 as well:
cffc3374e567 ("Btrfs: fix order by which delayed references are run")
Which in turn fixed another regression by the two commits previously
mentioned.
So fix this by bringing back the delayed reference merge code, with the
proper adaptations so that it operates against the new data structure
(linked list vs old red black tree implementation).
This issue was hit running fstest btrfs/063 in a loop. Several people have
reported this issue in the mailing list when running on kernels 4.2+.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: No code changes. Added Stéphane's Tested-by tag and made patch
part of a 2 fixes series, to reflect the other remaining fix
depends on this one.
fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/extent-tree.c | 14 ++++++
2 files changed, 127 insertions(+)
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ac3e81d..4832943 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
trans->delayed_ref_updates--;
}
+static bool merge_ref(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head,
+ struct btrfs_delayed_ref_node *ref,
+ u64 seq)
+{
+ struct btrfs_delayed_ref_node *next;
+ bool done = false;
+
+ next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
+ list);
+ while (!done && &next->list != &head->ref_list) {
+ int mod;
+ struct btrfs_delayed_ref_node *next2;
+
+ next2 = list_next_entry(next, list);
+
+ if (next == ref)
+ goto next;
+
+ if (seq && next->seq >= seq)
+ goto next;
+
+ if (next->type != ref->type || next->no_quota != ref->no_quota)
+ goto next;
+
+ if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
+ ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
+ comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
+ btrfs_delayed_node_to_tree_ref(next),
+ ref->type))
+ goto next;
+ if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
+ ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
+ comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
+ btrfs_delayed_node_to_data_ref(next)))
+ goto next;
+
+ if (ref->action == next->action) {
+ mod = next->ref_mod;
+ } else {
+ if (ref->ref_mod < next->ref_mod) {
+ swap(ref, next);
+ done = true;
+ }
+ mod = -next->ref_mod;
+ }
+
+ drop_delayed_ref(trans, delayed_refs, head, next);
+ ref->ref_mod += mod;
+ if (ref->ref_mod == 0) {
+ drop_delayed_ref(trans, delayed_refs, head, ref);
+ done = true;
+ } else {
+ /*
+ * Can't have multiples of the same ref on a tree block.
+ */
+ WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
+ ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
+ }
+next:
+ next = next2;
+ }
+
+ return done;
+}
+
+void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head)
+{
+ struct btrfs_delayed_ref_node *ref;
+ u64 seq = 0;
+
+ assert_spin_locked(&head->lock);
+
+ if (list_empty(&head->ref_list))
+ return;
+
+ /* We don't have too many refs to merge for data. */
+ if (head->is_data)
+ return;
+
+ spin_lock(&fs_info->tree_mod_seq_lock);
+ if (!list_empty(&fs_info->tree_mod_seq_list)) {
+ struct seq_list *elem;
+
+ elem = list_first_entry(&fs_info->tree_mod_seq_list,
+ struct seq_list, list);
+ seq = elem->seq;
+ }
+ spin_unlock(&fs_info->tree_mod_seq_lock);
+
+ ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
+ list);
+ while (&ref->list != &head->ref_list) {
+ if (seq && ref->seq >= seq)
+ goto next;
+
+ if (merge_ref(trans, delayed_refs, head, ref, seq)) {
+ if (list_empty(&head->ref_list))
+ break;
+ ref = list_first_entry(&head->ref_list,
+ struct btrfs_delayed_ref_node,
+ list);
+ continue;
+ }
+next:
+ ref = list_next_entry(ref, list);
+ }
+}
+
int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_root *delayed_refs,
u64 seq)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2df4bc7..dd5756c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
}
}
+ /*
+ * We need to try and merge add/drops of the same ref since we
+ * can run into issues with relocate dropping the implicit ref
+ * and then it being added back again before the drop can
+ * finish. If we merged anything we need to re-loop so we can
+ * get a good ref.
+ * Or we can get node references of the same type that weren't
+ * merged when created due to bumps in the tree mod seq, and
+ * we need to merge them to prevent adding an inline extent
+ * backref before dropping it (triggering a BUG_ON at
+ * insert_inline_extent_backref()).
+ */
spin_lock(&locked_ref->lock);
+ btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
+ locked_ref);
/*
* locked_ref is the head node, so we have to go one
--
2.1.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2 v2] Btrfs: fix regression running delayed references when using qgroups
2015-10-25 10:04 [PATCH 1/2 v2] Btrfs: fix regression when running delayed references fdmanana
@ 2015-10-25 10:04 ` fdmanana
2015-10-25 18:51 ` [PATCH 2/2 v3] " fdmanana
2015-10-25 18:51 ` [PATCH 1/2 v3] Btrfs: fix regression when running delayed references fdmanana
2015-10-26 9:22 ` [PATCH 1/2 v2] " Liu Bo
2 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2015-10-25 10:04 UTC (permalink / raw)
To: linux-btrfs; +Cc: stephane_btrfs, Filipe Manana
From: Filipe Manana <fdmanana@suse.com>
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: There was no v1. But since the first patch in the series became
a v2, made this one a v2 from the start.
fs/btrfs/ctree.h | 4 ++--
fs/btrfs/delayed-ref.c | 28 +++++++----------------
fs/btrfs/delayed-ref.h | 7 ++----
fs/btrfs/extent-tree.c | 45 +++++++++++++-----------------------
fs/btrfs/file.c | 10 ++++----
fs/btrfs/inode.c | 4 ++--
fs/btrfs/ioctl.c | 62 +-------------------------------------------------
fs/btrfs/relocation.c | 16 ++++++-------
fs/btrfs/tree-log.c | 2 +-
9 files changed, 44 insertions(+), 134 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 49bc792..43eb981 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3398,7 +3398,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
int btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, int no_quota);
+ u64 owner, u64 offset);
int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len,
int delalloc);
@@ -3411,7 +3411,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
- u64 root_objectid, u64 owner, u64 offset, int no_quota);
+ u64 root_objectid, u64 owner, u64 offset);
int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 4832943..7832031 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -220,7 +220,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
if (seq && next->seq >= seq)
goto next;
- if (next->type != ref->type || next->no_quota != ref->no_quota)
+ if (next->type != ref->type)
goto next;
if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
@@ -405,8 +405,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
exist = list_entry(href->ref_list.prev, struct btrfs_delayed_ref_node,
list);
/* No need to compare bytenr nor is_head */
- if (exist->type != ref->type || exist->no_quota != ref->no_quota ||
- exist->seq != ref->seq)
+ if (exist->type != ref->type || exist->seq != ref->seq)
goto add_tail;
if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY ||
@@ -637,7 +636,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_head *head_ref,
struct btrfs_delayed_ref_node *ref, u64 bytenr,
u64 num_bytes, u64 parent, u64 ref_root, int level,
- int action, int no_quota)
+ int action)
{
struct btrfs_delayed_tree_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
@@ -659,7 +658,6 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref->action = action;
ref->is_head = 0;
ref->in_tree = 1;
- ref->no_quota = no_quota;
ref->seq = seq;
full_ref = btrfs_delayed_node_to_tree_ref(ref);
@@ -692,7 +690,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_head *head_ref,
struct btrfs_delayed_ref_node *ref, u64 bytenr,
u64 num_bytes, u64 parent, u64 ref_root, u64 owner,
- u64 offset, int action, int no_quota)
+ u64 offset, int action)
{
struct btrfs_delayed_data_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
@@ -715,7 +713,6 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
ref->action = action;
ref->is_head = 0;
ref->in_tree = 1;
- ref->no_quota = no_quota;
ref->seq = seq;
full_ref = btrfs_delayed_node_to_data_ref(ref);
@@ -746,17 +743,13 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes, u64 parent,
u64 ref_root, int level, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota)
+ struct btrfs_delayed_extent_op *extent_op)
{
struct btrfs_delayed_tree_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
- if (!is_fstree(ref_root) || !fs_info->quota_enabled)
- no_quota = 0;
-
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
if (!ref)
@@ -785,8 +778,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
bytenr, num_bytes, action, 0);
add_delayed_tree_ref(fs_info, trans, head_ref, &ref->node, bytenr,
- num_bytes, parent, ref_root, level, action,
- no_quota);
+ num_bytes, parent, ref_root, level, action);
spin_unlock(&delayed_refs->lock);
return 0;
@@ -807,17 +799,13 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes,
u64 parent, u64 ref_root,
u64 owner, u64 offset, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota)
+ struct btrfs_delayed_extent_op *extent_op)
{
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
- if (!is_fstree(ref_root) || !fs_info->quota_enabled)
- no_quota = 0;
-
BUG_ON(extent_op && !extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
if (!ref)
@@ -853,7 +841,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
num_bytes, parent, ref_root, owner, offset,
- action, no_quota);
+ action);
spin_unlock(&delayed_refs->lock);
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 13fb5e6..930887a 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -68,7 +68,6 @@ struct btrfs_delayed_ref_node {
unsigned int action:8;
unsigned int type:8;
- unsigned int no_quota:1;
/* is this node still in the rbtree? */
unsigned int is_head:1;
unsigned int in_tree:1;
@@ -233,15 +232,13 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes, u64 parent,
u64 ref_root, int level, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota);
+ struct btrfs_delayed_extent_op *extent_op);
int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
u64 parent, u64 ref_root,
u64 owner, u64 offset, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota);
+ struct btrfs_delayed_extent_op *extent_op);
int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dd5756c..fd8cd5a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -95,8 +95,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 parent, u64 root_objectid,
u64 flags, struct btrfs_disk_key *key,
- int level, struct btrfs_key *ins,
- int no_quota);
+ int level, struct btrfs_key *ins);
static int do_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *extent_root, u64 flags,
int force);
@@ -2009,8 +2008,7 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
- u64 root_objectid, u64 owner, u64 offset,
- int no_quota)
+ u64 root_objectid, u64 owner, u64 offset)
{
int ret;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -2022,12 +2020,12 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, (int)owner,
- BTRFS_ADD_DELAYED_REF, NULL, no_quota);
+ BTRFS_ADD_DELAYED_REF, NULL);
} else {
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, owner, offset,
- BTRFS_ADD_DELAYED_REF, NULL, no_quota);
+ BTRFS_ADD_DELAYED_REF, NULL);
}
return ret;
}
@@ -2048,15 +2046,11 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
u64 num_bytes = node->num_bytes;
u64 refs;
int ret;
- int no_quota = node->no_quota;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- if (!is_fstree(root_objectid) || !root->fs_info->quota_enabled)
- no_quota = 1;
-
path->reada = 1;
path->leave_spinning = 1;
/* this will setup the path even if it fails to insert the back ref */
@@ -2291,8 +2285,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
parent, ref_root,
extent_op->flags_to_set,
&extent_op->key,
- ref->level, &ins,
- node->no_quota);
+ ref->level, &ins);
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
ret = __btrfs_inc_extent_ref(trans, root, node,
parent, ref_root,
@@ -3123,7 +3116,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
int level;
int ret = 0;
int (*process_func)(struct btrfs_trans_handle *, struct btrfs_root *,
- u64, u64, u64, u64, u64, u64, int);
+ u64, u64, u64, u64, u64, u64);
if (btrfs_test_is_dummy_root(root))
@@ -3164,15 +3157,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
key.offset -= btrfs_file_extent_offset(buf, fi);
ret = process_func(trans, root, bytenr, num_bytes,
parent, ref_root, key.objectid,
- key.offset, 1);
+ key.offset);
if (ret)
goto fail;
} else {
bytenr = btrfs_node_blockptr(buf, i);
num_bytes = root->nodesize;
ret = process_func(trans, root, bytenr, num_bytes,
- parent, ref_root, level - 1, 0,
- 1);
+ parent, ref_root, level - 1, 0);
if (ret)
goto fail;
}
@@ -6239,7 +6231,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
int extent_slot = 0;
int found_extent = 0;
int num_to_del = 1;
- int no_quota = node->no_quota;
u32 item_size;
u64 refs;
u64 bytenr = node->bytenr;
@@ -6248,9 +6239,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
bool skinny_metadata = btrfs_fs_incompat(root->fs_info,
SKINNY_METADATA);
- if (!info->quota_enabled || !is_fstree(root_objectid))
- no_quota = 1;
-
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -6576,7 +6564,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
buf->start, buf->len,
parent, root->root_key.objectid,
btrfs_header_level(buf),
- BTRFS_DROP_DELAYED_REF, NULL, 0);
+ BTRFS_DROP_DELAYED_REF, NULL);
BUG_ON(ret); /* -ENOMEM */
}
@@ -6624,7 +6612,7 @@ out:
/* Can return -ENOMEM */
int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, int no_quota)
+ u64 owner, u64 offset)
{
int ret;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -6647,13 +6635,13 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, (int)owner,
- BTRFS_DROP_DELAYED_REF, NULL, no_quota);
+ BTRFS_DROP_DELAYED_REF, NULL);
} else {
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, owner,
offset, BTRFS_DROP_DELAYED_REF,
- NULL, no_quota);
+ NULL);
}
return ret;
}
@@ -7435,8 +7423,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 parent, u64 root_objectid,
u64 flags, struct btrfs_disk_key *key,
- int level, struct btrfs_key *ins,
- int no_quota)
+ int level, struct btrfs_key *ins)
{
int ret;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -7526,7 +7513,7 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_data_ref(root->fs_info, trans, ins->objectid,
ins->offset, 0,
root_objectid, owner, offset,
- BTRFS_ADD_DELAYED_EXTENT, NULL, 0);
+ BTRFS_ADD_DELAYED_EXTENT, NULL);
return ret;
}
@@ -7740,7 +7727,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
ins.objectid, ins.offset,
parent, root_objectid, level,
BTRFS_ADD_DELAYED_EXTENT,
- extent_op, 0);
+ extent_op);
if (ret)
goto out_free_delayed;
}
@@ -8289,7 +8276,7 @@ skip:
}
}
ret = btrfs_free_extent(trans, root, bytenr, blocksize, parent,
- root->root_key.objectid, level - 1, 0, 0);
+ root->root_key.objectid, level - 1, 0);
BUG_ON(ret); /* -ENOMEM */
}
btrfs_tree_unlock(next);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7031e96..243f4bc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -847,7 +847,7 @@ next_slot:
disk_bytenr, num_bytes, 0,
root->root_key.objectid,
new_key.objectid,
- start - extent_offset, 1);
+ start - extent_offset);
BUG_ON(ret); /* -ENOMEM */
}
key.offset = start;
@@ -925,7 +925,7 @@ delete_extent_item:
disk_bytenr, num_bytes, 0,
root->root_key.objectid,
key.objectid, key.offset -
- extent_offset, 0);
+ extent_offset);
BUG_ON(ret); /* -ENOMEM */
inode_sub_bytes(inode,
extent_end - key.offset);
@@ -1204,7 +1204,7 @@ again:
ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
root->root_key.objectid,
- ino, orig_offset, 1);
+ ino, orig_offset);
BUG_ON(ret); /* -ENOMEM */
if (split == start) {
@@ -1231,7 +1231,7 @@ again:
del_nr++;
ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
0, root->root_key.objectid,
- ino, orig_offset, 0);
+ ino, orig_offset);
BUG_ON(ret); /* -ENOMEM */
}
other_start = 0;
@@ -1248,7 +1248,7 @@ again:
del_nr++;
ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
0, root->root_key.objectid,
- ino, orig_offset, 0);
+ ino, orig_offset);
BUG_ON(ret); /* -ENOMEM */
}
if (del_nr == 0) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5ce55f6..87a01ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2576,7 +2576,7 @@ again:
ret = btrfs_inc_extent_ref(trans, root, new->bytenr,
new->disk_len, 0,
backref->root_id, backref->inum,
- new->file_pos, 0); /* start - extent_offset */
+ new->file_pos); /* start - extent_offset */
if (ret) {
btrfs_abort_transaction(trans, root, ret);
goto out_free_path;
@@ -4514,7 +4514,7 @@ delete:
ret = btrfs_free_extent(trans, root, extent_start,
extent_num_bytes, 0,
btrfs_header_owner(leaf),
- ino, extent_offset, 0);
+ ino, extent_offset);
BUG_ON(ret);
if (btrfs_should_throttle_delayed_refs(trans, root))
btrfs_async_run_delayed_refs(root,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 685df7e..9021f0c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3202,41 +3202,6 @@ out:
return ret;
}
-/* Helper to check and see if this root currently has a ref on the given disk
- * bytenr. If it does then we need to update the quota for this root. This
- * doesn't do anything if quotas aren't enabled.
- */
-static int check_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
- u64 disko)
-{
- struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
- struct ulist *roots;
- struct ulist_iterator uiter;
- struct ulist_node *root_node = NULL;
- int ret;
-
- if (!root->fs_info->quota_enabled)
- return 1;
-
- btrfs_get_tree_mod_seq(root->fs_info, &tree_mod_seq_elem);
- ret = btrfs_find_all_roots(trans, root->fs_info, disko,
- tree_mod_seq_elem.seq, &roots);
- if (ret < 0)
- goto out;
- ret = 0;
- ULIST_ITER_INIT(&uiter);
- while ((root_node = ulist_next(roots, &uiter))) {
- if (root_node->val == root->objectid) {
- ret = 1;
- break;
- }
- }
- ulist_free(roots);
-out:
- btrfs_put_tree_mod_seq(root->fs_info, &tree_mod_seq_elem);
- return ret;
-}
-
static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
struct inode *inode,
u64 endoff,
@@ -3495,9 +3460,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
u32 nritems;
int slot;
int ret;
- int no_quota;
const u64 len = olen_aligned;
- u64 last_disko = 0;
u64 last_dest_end = destoff;
ret = -ENOMEM;
@@ -3543,7 +3506,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
nritems = btrfs_header_nritems(path->nodes[0]);
process_slot:
- no_quota = 1;
if (path->slots[0] >= nritems) {
ret = btrfs_next_leaf(BTRFS_I(src)->root, path);
if (ret < 0)
@@ -3695,35 +3657,13 @@ process_slot:
btrfs_set_file_extent_num_bytes(leaf, extent,
datal);
- /*
- * We need to look up the roots that point at
- * this bytenr and see if the new root does. If
- * it does not we need to make sure we update
- * quotas appropriately.
- */
- if (disko && root != BTRFS_I(src)->root &&
- disko != last_disko) {
- no_quota = check_ref(trans, root,
- disko);
- if (no_quota < 0) {
- btrfs_abort_transaction(trans,
- root,
- ret);
- btrfs_end_transaction(trans,
- root);
- ret = no_quota;
- goto out;
- }
- }
-
if (disko) {
inode_add_bytes(inode, datal);
ret = btrfs_inc_extent_ref(trans, root,
disko, diskl, 0,
root->root_key.objectid,
btrfs_ino(inode),
- new_key.offset - datao,
- no_quota);
+ new_key.offset - datao);
if (ret) {
btrfs_abort_transaction(trans,
root,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 58ede0a..2b68441 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1716,7 +1716,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
ret = btrfs_inc_extent_ref(trans, root, new_bytenr,
num_bytes, parent,
btrfs_header_owner(leaf),
- key.objectid, key.offset, 1);
+ key.objectid, key.offset);
if (ret) {
btrfs_abort_transaction(trans, root, ret);
break;
@@ -1724,7 +1724,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
parent, btrfs_header_owner(leaf),
- key.objectid, key.offset, 1);
+ key.objectid, key.offset);
if (ret) {
btrfs_abort_transaction(trans, root, ret);
break;
@@ -1900,23 +1900,21 @@ again:
ret = btrfs_inc_extent_ref(trans, src, old_bytenr, blocksize,
path->nodes[level]->start,
- src->root_key.objectid, level - 1, 0,
- 1);
+ src->root_key.objectid, level - 1, 0);
BUG_ON(ret);
ret = btrfs_inc_extent_ref(trans, dest, new_bytenr, blocksize,
0, dest->root_key.objectid, level - 1,
- 0, 1);
+ 0);
BUG_ON(ret);
ret = btrfs_free_extent(trans, src, new_bytenr, blocksize,
path->nodes[level]->start,
- src->root_key.objectid, level - 1, 0,
- 1);
+ src->root_key.objectid, level - 1, 0);
BUG_ON(ret);
ret = btrfs_free_extent(trans, dest, old_bytenr, blocksize,
0, dest->root_key.objectid, level - 1,
- 0, 1);
+ 0);
BUG_ON(ret);
btrfs_unlock_up_safe(path, 0);
@@ -2745,7 +2743,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
node->eb->start, blocksize,
upper->eb->start,
btrfs_header_owner(upper->eb),
- node->level, 0, 1);
+ node->level, 0);
BUG_ON(ret);
ret = btrfs_drop_subtree(trans, root, eb, upper->eb);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1fffe88..323e12c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -693,7 +693,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
ret = btrfs_inc_extent_ref(trans, root,
ins.objectid, ins.offset,
0, root->root_key.objectid,
- key->objectid, offset, 0);
+ key->objectid, offset);
if (ret)
goto out;
} else {
--
2.1.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
2015-10-25 10:04 [PATCH 1/2 v2] Btrfs: fix regression when running delayed references fdmanana
2015-10-25 10:04 ` [PATCH 2/2 v2] Btrfs: fix regression running delayed references when using qgroups fdmanana
@ 2015-10-25 18:51 ` fdmanana
2015-12-13 10:51 ` Alex Lyakas
2015-10-26 9:22 ` [PATCH 1/2 v2] " Liu Bo
2 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2015-10-25 18:51 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
From: Filipe Manana <fdmanana@suse.com>
In the kernel 4.2 merge window we had a refactoring/rework of the delayed
references implementation in order to fix certain problems with qgroups.
However that rework introduced one more regression that leads to the
following trace when running delayed references for metadata:
[35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
[35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
[35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
[35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
[35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
[35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
[35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
[35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
[35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
[35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
[35908.065201] Stack:
[35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
[35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
[35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
[35908.065201] Call Trace:
[35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
[35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
[35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
[35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
[35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
[35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
[35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
[35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
[35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
[35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
[35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
[35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
[35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP <ffff88010c4cbb08>
[35908.310885] ---[ end trace fe4299baf0666457 ]---
This happens because the new delayed references code no longer merges
delayed references that have different sequence values. The following
steps are an example sequence leading to this issue:
1) Transaction N starts, fs_info->tree_mod_seq has value 0;
2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
bytenr A is created, with a value of 1 and a seq value of 0;
3) fs_info->tree_mod_seq is incremented to 1;
4) Extent buffer A is deleted through btrfs_del_items(), which calls
btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
later returns the metadata extent associated to extent buffer A to
the free space cache (the range is not pinned), because the extent
buffer was created in the current transaction (N) and writeback never
happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
in the extent buffer).
This creates the delayed reference Ref2 for bytenr A, with a value
of -1 and a seq value of 1;
5) Delayed reference Ref2 is not merged with Ref1 when we create it,
because they have different sequence numbers (decided at
add_delayed_ref_tail_merge());
6) fs_info->tree_mod_seq is incremented to 2;
7) Some task attempts to allocate a new extent buffer (done at
extent-tree.c:find_free_extent()), but due to heavy fragmentation
and running low on metadata space the clustered allocation fails
and we fall back to unclustered allocation, which finds the
extent at offset A, so a new extent buffer at offset A is allocated.
This creates delayed reference Ref3 for bytenr A, with a value of -1
and a seq value of 2;
8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
all have different seq values;
9) We start running the delayed references (__btrfs_run_delayed_refs());
10) The delayed Ref1 is the first one being applied, which ends up
creating an inline extent backref in the extent tree;
10) Next the delayed reference Ref3 is selected for execution, and not
Ref2, because select_delayed_ref() always gives a preference for
positive references (that have an action of BTRFS_ADD_DELAYED_REF);
11) When running Ref3 we encounter alreay the inline extent backref
in the extent tree at insert_inline_extent_backref(), which makes
us hit the following BUG_ON:
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
This is always true because owner corresponds to the level of the
extent buffer/btree node in the btree.
For the scenario described above we hit the BUG_ON because we never merge
references that have different seq values.
We used to do the merging before the 4.2 kernel, more specifically, before
the commmits:
c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
This issue became more exposed after the following change that was added
to 4.2 as well:
cffc3374e567 ("Btrfs: fix order by which delayed references are run")
Which in turn fixed another regression by the two commits previously
mentioned.
So fix this by bringing back the delayed reference merge code, with the
proper adaptations so that it operates against the new data structure
(linked list vs old red black tree implementation).
This issue was hit running fstest btrfs/063 in a loop. Several people have
reported this issue in the mailing list when running on kernels 4.2+.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: No code changes. Added Stéphane's Tested-by tag and made patch
part of a 2 fixes series, to reflect the other remaining fix
depends on this one.
V3: No changes. Updated only the 2nd patch in the series to v3 (change log
updated).
fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/extent-tree.c | 14 ++++++
2 files changed, 127 insertions(+)
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index bd9b63b..2f41580 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
trans->delayed_ref_updates--;
}
+static bool merge_ref(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head,
+ struct btrfs_delayed_ref_node *ref,
+ u64 seq)
+{
+ struct btrfs_delayed_ref_node *next;
+ bool done = false;
+
+ next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
+ list);
+ while (!done && &next->list != &head->ref_list) {
+ int mod;
+ struct btrfs_delayed_ref_node *next2;
+
+ next2 = list_next_entry(next, list);
+
+ if (next == ref)
+ goto next;
+
+ if (seq && next->seq >= seq)
+ goto next;
+
+ if (next->type != ref->type || next->no_quota != ref->no_quota)
+ goto next;
+
+ if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
+ ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
+ comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
+ btrfs_delayed_node_to_tree_ref(next),
+ ref->type))
+ goto next;
+ if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
+ ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
+ comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
+ btrfs_delayed_node_to_data_ref(next)))
+ goto next;
+
+ if (ref->action == next->action) {
+ mod = next->ref_mod;
+ } else {
+ if (ref->ref_mod < next->ref_mod) {
+ swap(ref, next);
+ done = true;
+ }
+ mod = -next->ref_mod;
+ }
+
+ drop_delayed_ref(trans, delayed_refs, head, next);
+ ref->ref_mod += mod;
+ if (ref->ref_mod == 0) {
+ drop_delayed_ref(trans, delayed_refs, head, ref);
+ done = true;
+ } else {
+ /*
+ * Can't have multiples of the same ref on a tree block.
+ */
+ WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
+ ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
+ }
+next:
+ next = next2;
+ }
+
+ return done;
+}
+
+void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head)
+{
+ struct btrfs_delayed_ref_node *ref;
+ u64 seq = 0;
+
+ assert_spin_locked(&head->lock);
+
+ if (list_empty(&head->ref_list))
+ return;
+
+ /* We don't have too many refs to merge for data. */
+ if (head->is_data)
+ return;
+
+ spin_lock(&fs_info->tree_mod_seq_lock);
+ if (!list_empty(&fs_info->tree_mod_seq_list)) {
+ struct seq_list *elem;
+
+ elem = list_first_entry(&fs_info->tree_mod_seq_list,
+ struct seq_list, list);
+ seq = elem->seq;
+ }
+ spin_unlock(&fs_info->tree_mod_seq_lock);
+
+ ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
+ list);
+ while (&ref->list != &head->ref_list) {
+ if (seq && ref->seq >= seq)
+ goto next;
+
+ if (merge_ref(trans, delayed_refs, head, ref, seq)) {
+ if (list_empty(&head->ref_list))
+ break;
+ ref = list_first_entry(&head->ref_list,
+ struct btrfs_delayed_ref_node,
+ list);
+ continue;
+ }
+next:
+ ref = list_next_entry(ref, list);
+ }
+}
+
int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_root *delayed_refs,
u64 seq)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92fdbc6..c0f30f5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2502,7 +2502,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
}
}
+ /*
+ * We need to try and merge add/drops of the same ref since we
+ * can run into issues with relocate dropping the implicit ref
+ * and then it being added back again before the drop can
+ * finish. If we merged anything we need to re-loop so we can
+ * get a good ref.
+ * Or we can get node references of the same type that weren't
+ * merged when created due to bumps in the tree mod seq, and
+ * we need to merge them to prevent adding an inline extent
+ * backref before dropping it (triggering a BUG_ON at
+ * insert_inline_extent_backref()).
+ */
spin_lock(&locked_ref->lock);
+ btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
+ locked_ref);
/*
* locked_ref is the head node, so we have to go one
--
2.1.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2 v3] Btrfs: fix regression running delayed references when using qgroups
2015-10-25 10:04 ` [PATCH 2/2 v2] Btrfs: fix regression running delayed references when using qgroups fdmanana
@ 2015-10-25 18:51 ` fdmanana
2015-10-26 8:32 ` Qu Wenruo
0 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2015-10-25 18:51 UTC (permalink / raw)
To: linux-btrfs; +Cc: quwenruo, Filipe Manana
From: Filipe Manana <fdmanana@suse.com>
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: There was no v1. But since the first patch in the series became
a v2, made this one a v2 from the start.
V3: Updated changelog to point out the clone ioctl deadlock when
qgroups are enabled, and 2 bugs of no_quota field usage.
fs/btrfs/ctree.h | 4 ++--
fs/btrfs/delayed-ref.c | 28 +++++++----------------
fs/btrfs/delayed-ref.h | 7 ++----
fs/btrfs/extent-tree.c | 45 +++++++++++++-----------------------
fs/btrfs/file.c | 10 ++++----
fs/btrfs/inode.c | 4 ++--
fs/btrfs/ioctl.c | 62 +-------------------------------------------------
fs/btrfs/relocation.c | 16 ++++++-------
fs/btrfs/tree-log.c | 2 +-
9 files changed, 44 insertions(+), 134 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bc3c711..3fa3c3b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3422,7 +3422,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
int btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, int no_quota);
+ u64 owner, u64 offset);
int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len,
int delalloc);
@@ -3435,7 +3435,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
- u64 root_objectid, u64 owner, u64 offset, int no_quota);
+ u64 root_objectid, u64 owner, u64 offset);
int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 2f41580..1c3588a 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -220,7 +220,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
if (seq && next->seq >= seq)
goto next;
- if (next->type != ref->type || next->no_quota != ref->no_quota)
+ if (next->type != ref->type)
goto next;
if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
@@ -405,8 +405,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
exist = list_entry(href->ref_list.prev, struct btrfs_delayed_ref_node,
list);
/* No need to compare bytenr nor is_head */
- if (exist->type != ref->type || exist->no_quota != ref->no_quota ||
- exist->seq != ref->seq)
+ if (exist->type != ref->type || exist->seq != ref->seq)
goto add_tail;
if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY ||
@@ -639,7 +638,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_head *head_ref,
struct btrfs_delayed_ref_node *ref, u64 bytenr,
u64 num_bytes, u64 parent, u64 ref_root, int level,
- int action, int no_quota)
+ int action)
{
struct btrfs_delayed_tree_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
@@ -661,7 +660,6 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref->action = action;
ref->is_head = 0;
ref->in_tree = 1;
- ref->no_quota = no_quota;
ref->seq = seq;
full_ref = btrfs_delayed_node_to_tree_ref(ref);
@@ -694,7 +692,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_head *head_ref,
struct btrfs_delayed_ref_node *ref, u64 bytenr,
u64 num_bytes, u64 parent, u64 ref_root, u64 owner,
- u64 offset, int action, int no_quota)
+ u64 offset, int action)
{
struct btrfs_delayed_data_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
@@ -717,7 +715,6 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
ref->action = action;
ref->is_head = 0;
ref->in_tree = 1;
- ref->no_quota = no_quota;
ref->seq = seq;
full_ref = btrfs_delayed_node_to_data_ref(ref);
@@ -748,17 +745,13 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes, u64 parent,
u64 ref_root, int level, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota)
+ struct btrfs_delayed_extent_op *extent_op)
{
struct btrfs_delayed_tree_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
- if (!is_fstree(ref_root) || !fs_info->quota_enabled)
- no_quota = 0;
-
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
if (!ref)
@@ -787,8 +780,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
bytenr, num_bytes, action, 0);
add_delayed_tree_ref(fs_info, trans, head_ref, &ref->node, bytenr,
- num_bytes, parent, ref_root, level, action,
- no_quota);
+ num_bytes, parent, ref_root, level, action);
spin_unlock(&delayed_refs->lock);
return 0;
@@ -809,17 +801,13 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes,
u64 parent, u64 ref_root,
u64 owner, u64 offset, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota)
+ struct btrfs_delayed_extent_op *extent_op)
{
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
- if (!is_fstree(ref_root) || !fs_info->quota_enabled)
- no_quota = 0;
-
BUG_ON(extent_op && !extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
if (!ref)
@@ -855,7 +843,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
num_bytes, parent, ref_root, owner, offset,
- action, no_quota);
+ action);
spin_unlock(&delayed_refs->lock);
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d4c41e2..f9cf234 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -68,7 +68,6 @@ struct btrfs_delayed_ref_node {
unsigned int action:8;
unsigned int type:8;
- unsigned int no_quota:1;
/* is this node still in the rbtree? */
unsigned int is_head:1;
unsigned int in_tree:1;
@@ -244,15 +243,13 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes, u64 parent,
u64 ref_root, int level, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota);
+ struct btrfs_delayed_extent_op *extent_op);
int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
u64 parent, u64 ref_root,
u64 owner, u64 offset, int action,
- struct btrfs_delayed_extent_op *extent_op,
- int no_quota);
+ struct btrfs_delayed_extent_op *extent_op);
int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 ref_root, u64 bytenr, u64 num_bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c0f30f5..c1f8c7e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -95,8 +95,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 parent, u64 root_objectid,
u64 flags, struct btrfs_disk_key *key,
- int level, struct btrfs_key *ins,
- int no_quota);
+ int level, struct btrfs_key *ins);
static int do_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *extent_root, u64 flags,
int force);
@@ -2073,8 +2072,7 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
- u64 root_objectid, u64 owner, u64 offset,
- int no_quota)
+ u64 root_objectid, u64 owner, u64 offset)
{
int ret;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -2086,12 +2084,12 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, (int)owner,
- BTRFS_ADD_DELAYED_REF, NULL, no_quota);
+ BTRFS_ADD_DELAYED_REF, NULL);
} else {
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, owner, offset,
- BTRFS_ADD_DELAYED_REF, NULL, no_quota);
+ BTRFS_ADD_DELAYED_REF, NULL);
}
return ret;
}
@@ -2112,15 +2110,11 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
u64 num_bytes = node->num_bytes;
u64 refs;
int ret;
- int no_quota = node->no_quota;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- if (!is_fstree(root_objectid) || !root->fs_info->quota_enabled)
- no_quota = 1;
-
path->reada = 1;
path->leave_spinning = 1;
/* this will setup the path even if it fails to insert the back ref */
@@ -2355,8 +2349,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
parent, ref_root,
extent_op->flags_to_set,
&extent_op->key,
- ref->level, &ins,
- node->no_quota);
+ ref->level, &ins);
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
ret = __btrfs_inc_extent_ref(trans, root, node,
parent, ref_root,
@@ -3192,7 +3185,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
int level;
int ret = 0;
int (*process_func)(struct btrfs_trans_handle *, struct btrfs_root *,
- u64, u64, u64, u64, u64, u64, int);
+ u64, u64, u64, u64, u64, u64);
if (btrfs_test_is_dummy_root(root))
@@ -3233,15 +3226,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
key.offset -= btrfs_file_extent_offset(buf, fi);
ret = process_func(trans, root, bytenr, num_bytes,
parent, ref_root, key.objectid,
- key.offset, 1);
+ key.offset);
if (ret)
goto fail;
} else {
bytenr = btrfs_node_blockptr(buf, i);
num_bytes = root->nodesize;
ret = process_func(trans, root, bytenr, num_bytes,
- parent, ref_root, level - 1, 0,
- 1);
+ parent, ref_root, level - 1, 0);
if (ret)
goto fail;
}
@@ -6431,7 +6423,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
int extent_slot = 0;
int found_extent = 0;
int num_to_del = 1;
- int no_quota = node->no_quota;
u32 item_size;
u64 refs;
u64 bytenr = node->bytenr;
@@ -6440,9 +6431,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
bool skinny_metadata = btrfs_fs_incompat(root->fs_info,
SKINNY_METADATA);
- if (!info->quota_enabled || !is_fstree(root_objectid))
- no_quota = 1;
-
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -6768,7 +6756,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
buf->start, buf->len,
parent, root->root_key.objectid,
btrfs_header_level(buf),
- BTRFS_DROP_DELAYED_REF, NULL, 0);
+ BTRFS_DROP_DELAYED_REF, NULL);
BUG_ON(ret); /* -ENOMEM */
}
@@ -6816,7 +6804,7 @@ out:
/* Can return -ENOMEM */
int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, int no_quota)
+ u64 owner, u64 offset)
{
int ret;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -6839,13 +6827,13 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, (int)owner,
- BTRFS_DROP_DELAYED_REF, NULL, no_quota);
+ BTRFS_DROP_DELAYED_REF, NULL);
} else {
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
num_bytes,
parent, root_objectid, owner,
offset, BTRFS_DROP_DELAYED_REF,
- NULL, no_quota);
+ NULL);
}
return ret;
}
@@ -7690,8 +7678,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 parent, u64 root_objectid,
u64 flags, struct btrfs_disk_key *key,
- int level, struct btrfs_key *ins,
- int no_quota)
+ int level, struct btrfs_key *ins)
{
int ret;
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -7781,7 +7768,7 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_data_ref(root->fs_info, trans, ins->objectid,
ins->offset, 0,
root_objectid, owner, offset,
- BTRFS_ADD_DELAYED_EXTENT, NULL, 0);
+ BTRFS_ADD_DELAYED_EXTENT, NULL);
return ret;
}
@@ -7995,7 +7982,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
ins.objectid, ins.offset,
parent, root_objectid, level,
BTRFS_ADD_DELAYED_EXTENT,
- extent_op, 0);
+ extent_op);
if (ret)
goto out_free_delayed;
}
@@ -8544,7 +8531,7 @@ skip:
}
}
ret = btrfs_free_extent(trans, root, bytenr, blocksize, parent,
- root->root_key.objectid, level - 1, 0, 0);
+ root->root_key.objectid, level - 1, 0);
BUG_ON(ret); /* -ENOMEM */
}
btrfs_tree_unlock(next);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1243205..381be79 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -847,7 +847,7 @@ next_slot:
disk_bytenr, num_bytes, 0,
root->root_key.objectid,
new_key.objectid,
- start - extent_offset, 1);
+ start - extent_offset);
BUG_ON(ret); /* -ENOMEM */
}
key.offset = start;
@@ -925,7 +925,7 @@ delete_extent_item:
disk_bytenr, num_bytes, 0,
root->root_key.objectid,
key.objectid, key.offset -
- extent_offset, 0);
+ extent_offset);
BUG_ON(ret); /* -ENOMEM */
inode_sub_bytes(inode,
extent_end - key.offset);
@@ -1204,7 +1204,7 @@ again:
ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
root->root_key.objectid,
- ino, orig_offset, 1);
+ ino, orig_offset);
BUG_ON(ret); /* -ENOMEM */
if (split == start) {
@@ -1231,7 +1231,7 @@ again:
del_nr++;
ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
0, root->root_key.objectid,
- ino, orig_offset, 0);
+ ino, orig_offset);
BUG_ON(ret); /* -ENOMEM */
}
other_start = 0;
@@ -1248,7 +1248,7 @@ again:
del_nr++;
ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
0, root->root_key.objectid,
- ino, orig_offset, 0);
+ ino, orig_offset);
BUG_ON(ret); /* -ENOMEM */
}
if (del_nr == 0) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a018e47..6f030c2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2595,7 +2595,7 @@ again:
ret = btrfs_inc_extent_ref(trans, root, new->bytenr,
new->disk_len, 0,
backref->root_id, backref->inum,
- new->file_pos, 0); /* start - extent_offset */
+ new->file_pos); /* start - extent_offset */
if (ret) {
btrfs_abort_transaction(trans, root, ret);
goto out_free_path;
@@ -4541,7 +4541,7 @@ delete:
ret = btrfs_free_extent(trans, root, extent_start,
extent_num_bytes, 0,
btrfs_header_owner(leaf),
- ino, extent_offset, 0);
+ ino, extent_offset);
BUG_ON(ret);
if (btrfs_should_throttle_delayed_refs(trans, root))
btrfs_async_run_delayed_refs(root,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7ed033a..4df0f2b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3206,41 +3206,6 @@ out:
return ret;
}
-/* Helper to check and see if this root currently has a ref on the given disk
- * bytenr. If it does then we need to update the quota for this root. This
- * doesn't do anything if quotas aren't enabled.
- */
-static int check_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
- u64 disko)
-{
- struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
- struct ulist *roots;
- struct ulist_iterator uiter;
- struct ulist_node *root_node = NULL;
- int ret;
-
- if (!root->fs_info->quota_enabled)
- return 1;
-
- btrfs_get_tree_mod_seq(root->fs_info, &tree_mod_seq_elem);
- ret = btrfs_find_all_roots(trans, root->fs_info, disko,
- tree_mod_seq_elem.seq, &roots);
- if (ret < 0)
- goto out;
- ret = 0;
- ULIST_ITER_INIT(&uiter);
- while ((root_node = ulist_next(roots, &uiter))) {
- if (root_node->val == root->objectid) {
- ret = 1;
- break;
- }
- }
- ulist_free(roots);
-out:
- btrfs_put_tree_mod_seq(root->fs_info, &tree_mod_seq_elem);
- return ret;
-}
-
static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
struct inode *inode,
u64 endoff,
@@ -3499,9 +3464,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
u32 nritems;
int slot;
int ret;
- int no_quota;
const u64 len = olen_aligned;
- u64 last_disko = 0;
u64 last_dest_end = destoff;
ret = -ENOMEM;
@@ -3547,7 +3510,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
nritems = btrfs_header_nritems(path->nodes[0]);
process_slot:
- no_quota = 1;
if (path->slots[0] >= nritems) {
ret = btrfs_next_leaf(BTRFS_I(src)->root, path);
if (ret < 0)
@@ -3699,35 +3661,13 @@ process_slot:
btrfs_set_file_extent_num_bytes(leaf, extent,
datal);
- /*
- * We need to look up the roots that point at
- * this bytenr and see if the new root does. If
- * it does not we need to make sure we update
- * quotas appropriately.
- */
- if (disko && root != BTRFS_I(src)->root &&
- disko != last_disko) {
- no_quota = check_ref(trans, root,
- disko);
- if (no_quota < 0) {
- btrfs_abort_transaction(trans,
- root,
- ret);
- btrfs_end_transaction(trans,
- root);
- ret = no_quota;
- goto out;
- }
- }
-
if (disko) {
inode_add_bytes(inode, datal);
ret = btrfs_inc_extent_ref(trans, root,
disko, diskl, 0,
root->root_key.objectid,
btrfs_ino(inode),
- new_key.offset - datao,
- no_quota);
+ new_key.offset - datao);
if (ret) {
btrfs_abort_transaction(trans,
root,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index a7dc456..b4ca545 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1716,7 +1716,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
ret = btrfs_inc_extent_ref(trans, root, new_bytenr,
num_bytes, parent,
btrfs_header_owner(leaf),
- key.objectid, key.offset, 1);
+ key.objectid, key.offset);
if (ret) {
btrfs_abort_transaction(trans, root, ret);
break;
@@ -1724,7 +1724,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
parent, btrfs_header_owner(leaf),
- key.objectid, key.offset, 1);
+ key.objectid, key.offset);
if (ret) {
btrfs_abort_transaction(trans, root, ret);
break;
@@ -1900,23 +1900,21 @@ again:
ret = btrfs_inc_extent_ref(trans, src, old_bytenr, blocksize,
path->nodes[level]->start,
- src->root_key.objectid, level - 1, 0,
- 1);
+ src->root_key.objectid, level - 1, 0);
BUG_ON(ret);
ret = btrfs_inc_extent_ref(trans, dest, new_bytenr, blocksize,
0, dest->root_key.objectid, level - 1,
- 0, 1);
+ 0);
BUG_ON(ret);
ret = btrfs_free_extent(trans, src, new_bytenr, blocksize,
path->nodes[level]->start,
- src->root_key.objectid, level - 1, 0,
- 1);
+ src->root_key.objectid, level - 1, 0);
BUG_ON(ret);
ret = btrfs_free_extent(trans, dest, old_bytenr, blocksize,
0, dest->root_key.objectid, level - 1,
- 0, 1);
+ 0);
BUG_ON(ret);
btrfs_unlock_up_safe(path, 0);
@@ -2745,7 +2743,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
node->eb->start, blocksize,
upper->eb->start,
btrfs_header_owner(upper->eb),
- node->level, 0, 1);
+ node->level, 0);
BUG_ON(ret);
ret = btrfs_drop_subtree(trans, root, eb, upper->eb);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1fffe88..323e12c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -693,7 +693,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
ret = btrfs_inc_extent_ref(trans, root,
ins.objectid, ins.offset,
0, root->root_key.objectid,
- key->objectid, offset, 0);
+ key->objectid, offset);
if (ret)
goto out;
} else {
--
2.1.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2 v3] Btrfs: fix regression running delayed references when using qgroups
2015-10-25 18:51 ` [PATCH 2/2 v3] " fdmanana
@ 2015-10-26 8:32 ` Qu Wenruo
0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2015-10-26 8:32 UTC (permalink / raw)
To: fdmanana, linux-btrfs; +Cc: Filipe Manana
wrote on 2015/10/25 18:51 +0000:
> From: Filipe Manana <fdmanana@suse.com>
>
> In the kernel 4.2 merge window we had a big changes to the implementation
> of delayed references and qgroups which made the no_quota field of delayed
> references not used anymore. More specifically the no_quota field is not
> used anymore as of:
>
> commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
>
> Leaving the no_quota field actually prevents delayed references from
> getting merged, which in turn cause the following BUG_ON(), at
> fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
>
> static int run_delayed_tree_ref(...)
> {
> (...)
> BUG_ON(node->ref_mod != 1);
> (...)
> }
>
> This happens on a scenario like the following:
>
> 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
>
> 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
> It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
>
> 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
> It's not merged with the reference at the tail of the list of refs
> for bytenr X because the reference at the tail, Ref2 is incompatible
> due to Ref2->no_quota != Ref3->no_quota.
>
> 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
> It's not merged with the reference at the tail of the list of refs
> for bytenr X because the reference at the tail, Ref3 is incompatible
> due to Ref3->no_quota != Ref4->no_quota.
>
> 5) We run delayed references, trigger merging of delayed references,
> through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
>
> 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
> all other conditions are satisfied too. So Ref1 gets a ref_mod
> value of 2.
>
> 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
> all other conditions are satisfied too. So Ref2 gets a ref_mod
> value of 2.
>
> 8) Ref1 and Ref2 aren't merged, because they have different values
> for their no_quota field.
>
> 9) Delayed reference Ref1 is picked for running (select_delayed_ref()
> always prefers references with an action == BTRFS_ADD_DELAYED_REF).
> So run_delayed_tree_ref() is called for Ref1 which triggers the
> BUG_ON because Ref1->red_mod != 1 (equals 2).
>
> So fix this by removing the no_quota field, as it's not used anymore as
> of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
> qgroup mechanism.").
>
> The use of no_quota was also buggy in at least two places:
>
> 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
> no_quota to 0 instead of 1 when the following condition was true:
> is_fstree(ref_root) || !fs_info->quota_enabled
>
> 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
> reset a node's no_quota when the condition "!is_fstree(root_objectid)
> || !root->fs_info->quota_enabled" was true but we did it only in
> an unused local stack variable, that is, we never reset the no_quota
> value in the node itself.
>
> This fixes the remainder of problems several people have been having when
> running delayed references, mostly while a balance is running in parallel,
> on a 4.2+ kernel.
>
> Very special thanks to Stéphane Lesimple for helping debugging this issue
> and testing this fix on his multi terabyte filesystem (which took more
> than one day to balance alone, plus fsck, etc).
>
> Also, this fixes deadlock issue when using the clone ioctl with qgroups
> enabled, as reported by Elias Probst in the mailing list. The deadlock
> happens because after calling btrfs_insert_empty_item we have our path
> holding a write lock on a leaf of the fs/subvol tree and then before
> releasing the path we called check_ref() which did backref walking, when
> qgroups are enabled, and tried to read lock the same leaf. The trace for
> this case is the following:
>
> INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
> (...)
> Call Trace:
> [<ffffffff86999201>] schedule+0x74/0x83
> [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
> [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
> [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
> [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
> [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
> [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
> [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
> [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
> [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
> [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
> [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
> [<ffffffff863e852e>] check_ref+0x64/0xc4
> [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
> [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
> [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
> [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
> (...)
>
> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Reported-by: Elias Probst <mail@eliasprobst.eu>
> Reported-by: Peter Becker <floyd.net@gmail.com>
> Reported-by: Malte Schröder <malte@tnxip.de>
> Reported-by: Derek Dongray <derek@valedon.co.uk>
> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
> Cc: stable@vger.kernel.org # 4.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Thanks,
Qu
> ---
>
> V2: There was no v1. But since the first patch in the series became
> a v2, made this one a v2 from the start.
> V3: Updated changelog to point out the clone ioctl deadlock when
> qgroups are enabled, and 2 bugs of no_quota field usage.
>
> fs/btrfs/ctree.h | 4 ++--
> fs/btrfs/delayed-ref.c | 28 +++++++----------------
> fs/btrfs/delayed-ref.h | 7 ++----
> fs/btrfs/extent-tree.c | 45 +++++++++++++-----------------------
> fs/btrfs/file.c | 10 ++++----
> fs/btrfs/inode.c | 4 ++--
> fs/btrfs/ioctl.c | 62 +-------------------------------------------------
> fs/btrfs/relocation.c | 16 ++++++-------
> fs/btrfs/tree-log.c | 2 +-
> 9 files changed, 44 insertions(+), 134 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index bc3c711..3fa3c3b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3422,7 +3422,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
> int btrfs_free_extent(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
> - u64 owner, u64 offset, int no_quota);
> + u64 owner, u64 offset);
>
> int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len,
> int delalloc);
> @@ -3435,7 +3435,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
> int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> u64 bytenr, u64 num_bytes, u64 parent,
> - u64 root_objectid, u64 owner, u64 offset, int no_quota);
> + u64 root_objectid, u64 owner, u64 offset);
>
> int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
> struct btrfs_root *root);
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 2f41580..1c3588a 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -220,7 +220,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
> if (seq && next->seq >= seq)
> goto next;
>
> - if (next->type != ref->type || next->no_quota != ref->no_quota)
> + if (next->type != ref->type)
> goto next;
>
> if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> @@ -405,8 +405,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
> exist = list_entry(href->ref_list.prev, struct btrfs_delayed_ref_node,
> list);
> /* No need to compare bytenr nor is_head */
> - if (exist->type != ref->type || exist->no_quota != ref->no_quota ||
> - exist->seq != ref->seq)
> + if (exist->type != ref->type || exist->seq != ref->seq)
> goto add_tail;
>
> if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY ||
> @@ -639,7 +638,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
> struct btrfs_delayed_ref_head *head_ref,
> struct btrfs_delayed_ref_node *ref, u64 bytenr,
> u64 num_bytes, u64 parent, u64 ref_root, int level,
> - int action, int no_quota)
> + int action)
> {
> struct btrfs_delayed_tree_ref *full_ref;
> struct btrfs_delayed_ref_root *delayed_refs;
> @@ -661,7 +660,6 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
> ref->action = action;
> ref->is_head = 0;
> ref->in_tree = 1;
> - ref->no_quota = no_quota;
> ref->seq = seq;
>
> full_ref = btrfs_delayed_node_to_tree_ref(ref);
> @@ -694,7 +692,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
> struct btrfs_delayed_ref_head *head_ref,
> struct btrfs_delayed_ref_node *ref, u64 bytenr,
> u64 num_bytes, u64 parent, u64 ref_root, u64 owner,
> - u64 offset, int action, int no_quota)
> + u64 offset, int action)
> {
> struct btrfs_delayed_data_ref *full_ref;
> struct btrfs_delayed_ref_root *delayed_refs;
> @@ -717,7 +715,6 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
> ref->action = action;
> ref->is_head = 0;
> ref->in_tree = 1;
> - ref->no_quota = no_quota;
> ref->seq = seq;
>
> full_ref = btrfs_delayed_node_to_data_ref(ref);
> @@ -748,17 +745,13 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
> struct btrfs_trans_handle *trans,
> u64 bytenr, u64 num_bytes, u64 parent,
> u64 ref_root, int level, int action,
> - struct btrfs_delayed_extent_op *extent_op,
> - int no_quota)
> + struct btrfs_delayed_extent_op *extent_op)
> {
> struct btrfs_delayed_tree_ref *ref;
> struct btrfs_delayed_ref_head *head_ref;
> struct btrfs_delayed_ref_root *delayed_refs;
> struct btrfs_qgroup_extent_record *record = NULL;
>
> - if (!is_fstree(ref_root) || !fs_info->quota_enabled)
> - no_quota = 0;
> -
> BUG_ON(extent_op && extent_op->is_data);
> ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
> if (!ref)
> @@ -787,8 +780,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
> bytenr, num_bytes, action, 0);
>
> add_delayed_tree_ref(fs_info, trans, head_ref, &ref->node, bytenr,
> - num_bytes, parent, ref_root, level, action,
> - no_quota);
> + num_bytes, parent, ref_root, level, action);
> spin_unlock(&delayed_refs->lock);
>
> return 0;
> @@ -809,17 +801,13 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
> u64 bytenr, u64 num_bytes,
> u64 parent, u64 ref_root,
> u64 owner, u64 offset, int action,
> - struct btrfs_delayed_extent_op *extent_op,
> - int no_quota)
> + struct btrfs_delayed_extent_op *extent_op)
> {
> struct btrfs_delayed_data_ref *ref;
> struct btrfs_delayed_ref_head *head_ref;
> struct btrfs_delayed_ref_root *delayed_refs;
> struct btrfs_qgroup_extent_record *record = NULL;
>
> - if (!is_fstree(ref_root) || !fs_info->quota_enabled)
> - no_quota = 0;
> -
> BUG_ON(extent_op && !extent_op->is_data);
> ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
> if (!ref)
> @@ -855,7 +843,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
>
> add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
> num_bytes, parent, ref_root, owner, offset,
> - action, no_quota);
> + action);
> spin_unlock(&delayed_refs->lock);
>
> return 0;
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index d4c41e2..f9cf234 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -68,7 +68,6 @@ struct btrfs_delayed_ref_node {
>
> unsigned int action:8;
> unsigned int type:8;
> - unsigned int no_quota:1;
> /* is this node still in the rbtree? */
> unsigned int is_head:1;
> unsigned int in_tree:1;
> @@ -244,15 +243,13 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
> struct btrfs_trans_handle *trans,
> u64 bytenr, u64 num_bytes, u64 parent,
> u64 ref_root, int level, int action,
> - struct btrfs_delayed_extent_op *extent_op,
> - int no_quota);
> + struct btrfs_delayed_extent_op *extent_op);
> int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
> struct btrfs_trans_handle *trans,
> u64 bytenr, u64 num_bytes,
> u64 parent, u64 ref_root,
> u64 owner, u64 offset, int action,
> - struct btrfs_delayed_extent_op *extent_op,
> - int no_quota);
> + struct btrfs_delayed_extent_op *extent_op);
> int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
> struct btrfs_trans_handle *trans,
> u64 ref_root, u64 bytenr, u64 num_bytes);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c0f30f5..c1f8c7e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -95,8 +95,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> u64 parent, u64 root_objectid,
> u64 flags, struct btrfs_disk_key *key,
> - int level, struct btrfs_key *ins,
> - int no_quota);
> + int level, struct btrfs_key *ins);
> static int do_chunk_alloc(struct btrfs_trans_handle *trans,
> struct btrfs_root *extent_root, u64 flags,
> int force);
> @@ -2073,8 +2072,7 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
> int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> u64 bytenr, u64 num_bytes, u64 parent,
> - u64 root_objectid, u64 owner, u64 offset,
> - int no_quota)
> + u64 root_objectid, u64 owner, u64 offset)
> {
> int ret;
> struct btrfs_fs_info *fs_info = root->fs_info;
> @@ -2086,12 +2084,12 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
> ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
> num_bytes,
> parent, root_objectid, (int)owner,
> - BTRFS_ADD_DELAYED_REF, NULL, no_quota);
> + BTRFS_ADD_DELAYED_REF, NULL);
> } else {
> ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
> num_bytes,
> parent, root_objectid, owner, offset,
> - BTRFS_ADD_DELAYED_REF, NULL, no_quota);
> + BTRFS_ADD_DELAYED_REF, NULL);
> }
> return ret;
> }
> @@ -2112,15 +2110,11 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
> u64 num_bytes = node->num_bytes;
> u64 refs;
> int ret;
> - int no_quota = node->no_quota;
>
> path = btrfs_alloc_path();
> if (!path)
> return -ENOMEM;
>
> - if (!is_fstree(root_objectid) || !root->fs_info->quota_enabled)
> - no_quota = 1;
> -
> path->reada = 1;
> path->leave_spinning = 1;
> /* this will setup the path even if it fails to insert the back ref */
> @@ -2355,8 +2349,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
> parent, ref_root,
> extent_op->flags_to_set,
> &extent_op->key,
> - ref->level, &ins,
> - node->no_quota);
> + ref->level, &ins);
> } else if (node->action == BTRFS_ADD_DELAYED_REF) {
> ret = __btrfs_inc_extent_ref(trans, root, node,
> parent, ref_root,
> @@ -3192,7 +3185,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
> int level;
> int ret = 0;
> int (*process_func)(struct btrfs_trans_handle *, struct btrfs_root *,
> - u64, u64, u64, u64, u64, u64, int);
> + u64, u64, u64, u64, u64, u64);
>
>
> if (btrfs_test_is_dummy_root(root))
> @@ -3233,15 +3226,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
> key.offset -= btrfs_file_extent_offset(buf, fi);
> ret = process_func(trans, root, bytenr, num_bytes,
> parent, ref_root, key.objectid,
> - key.offset, 1);
> + key.offset);
> if (ret)
> goto fail;
> } else {
> bytenr = btrfs_node_blockptr(buf, i);
> num_bytes = root->nodesize;
> ret = process_func(trans, root, bytenr, num_bytes,
> - parent, ref_root, level - 1, 0,
> - 1);
> + parent, ref_root, level - 1, 0);
> if (ret)
> goto fail;
> }
> @@ -6431,7 +6423,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> int extent_slot = 0;
> int found_extent = 0;
> int num_to_del = 1;
> - int no_quota = node->no_quota;
> u32 item_size;
> u64 refs;
> u64 bytenr = node->bytenr;
> @@ -6440,9 +6431,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> bool skinny_metadata = btrfs_fs_incompat(root->fs_info,
> SKINNY_METADATA);
>
> - if (!info->quota_enabled || !is_fstree(root_objectid))
> - no_quota = 1;
> -
> path = btrfs_alloc_path();
> if (!path)
> return -ENOMEM;
> @@ -6768,7 +6756,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> buf->start, buf->len,
> parent, root->root_key.objectid,
> btrfs_header_level(buf),
> - BTRFS_DROP_DELAYED_REF, NULL, 0);
> + BTRFS_DROP_DELAYED_REF, NULL);
> BUG_ON(ret); /* -ENOMEM */
> }
>
> @@ -6816,7 +6804,7 @@ out:
> /* Can return -ENOMEM */
> int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
> - u64 owner, u64 offset, int no_quota)
> + u64 owner, u64 offset)
> {
> int ret;
> struct btrfs_fs_info *fs_info = root->fs_info;
> @@ -6839,13 +6827,13 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
> num_bytes,
> parent, root_objectid, (int)owner,
> - BTRFS_DROP_DELAYED_REF, NULL, no_quota);
> + BTRFS_DROP_DELAYED_REF, NULL);
> } else {
> ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
> num_bytes,
> parent, root_objectid, owner,
> offset, BTRFS_DROP_DELAYED_REF,
> - NULL, no_quota);
> + NULL);
> }
> return ret;
> }
> @@ -7690,8 +7678,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> u64 parent, u64 root_objectid,
> u64 flags, struct btrfs_disk_key *key,
> - int level, struct btrfs_key *ins,
> - int no_quota)
> + int level, struct btrfs_key *ins)
> {
> int ret;
> struct btrfs_fs_info *fs_info = root->fs_info;
> @@ -7781,7 +7768,7 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> ret = btrfs_add_delayed_data_ref(root->fs_info, trans, ins->objectid,
> ins->offset, 0,
> root_objectid, owner, offset,
> - BTRFS_ADD_DELAYED_EXTENT, NULL, 0);
> + BTRFS_ADD_DELAYED_EXTENT, NULL);
> return ret;
> }
>
> @@ -7995,7 +7982,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
> ins.objectid, ins.offset,
> parent, root_objectid, level,
> BTRFS_ADD_DELAYED_EXTENT,
> - extent_op, 0);
> + extent_op);
> if (ret)
> goto out_free_delayed;
> }
> @@ -8544,7 +8531,7 @@ skip:
> }
> }
> ret = btrfs_free_extent(trans, root, bytenr, blocksize, parent,
> - root->root_key.objectid, level - 1, 0, 0);
> + root->root_key.objectid, level - 1, 0);
> BUG_ON(ret); /* -ENOMEM */
> }
> btrfs_tree_unlock(next);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 1243205..381be79 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -847,7 +847,7 @@ next_slot:
> disk_bytenr, num_bytes, 0,
> root->root_key.objectid,
> new_key.objectid,
> - start - extent_offset, 1);
> + start - extent_offset);
> BUG_ON(ret); /* -ENOMEM */
> }
> key.offset = start;
> @@ -925,7 +925,7 @@ delete_extent_item:
> disk_bytenr, num_bytes, 0,
> root->root_key.objectid,
> key.objectid, key.offset -
> - extent_offset, 0);
> + extent_offset);
> BUG_ON(ret); /* -ENOMEM */
> inode_sub_bytes(inode,
> extent_end - key.offset);
> @@ -1204,7 +1204,7 @@ again:
>
> ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
> root->root_key.objectid,
> - ino, orig_offset, 1);
> + ino, orig_offset);
> BUG_ON(ret); /* -ENOMEM */
>
> if (split == start) {
> @@ -1231,7 +1231,7 @@ again:
> del_nr++;
> ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
> 0, root->root_key.objectid,
> - ino, orig_offset, 0);
> + ino, orig_offset);
> BUG_ON(ret); /* -ENOMEM */
> }
> other_start = 0;
> @@ -1248,7 +1248,7 @@ again:
> del_nr++;
> ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
> 0, root->root_key.objectid,
> - ino, orig_offset, 0);
> + ino, orig_offset);
> BUG_ON(ret); /* -ENOMEM */
> }
> if (del_nr == 0) {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a018e47..6f030c2 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2595,7 +2595,7 @@ again:
> ret = btrfs_inc_extent_ref(trans, root, new->bytenr,
> new->disk_len, 0,
> backref->root_id, backref->inum,
> - new->file_pos, 0); /* start - extent_offset */
> + new->file_pos); /* start - extent_offset */
> if (ret) {
> btrfs_abort_transaction(trans, root, ret);
> goto out_free_path;
> @@ -4541,7 +4541,7 @@ delete:
> ret = btrfs_free_extent(trans, root, extent_start,
> extent_num_bytes, 0,
> btrfs_header_owner(leaf),
> - ino, extent_offset, 0);
> + ino, extent_offset);
> BUG_ON(ret);
> if (btrfs_should_throttle_delayed_refs(trans, root))
> btrfs_async_run_delayed_refs(root,
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7ed033a..4df0f2b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3206,41 +3206,6 @@ out:
> return ret;
> }
>
> -/* Helper to check and see if this root currently has a ref on the given disk
> - * bytenr. If it does then we need to update the quota for this root. This
> - * doesn't do anything if quotas aren't enabled.
> - */
> -static int check_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> - u64 disko)
> -{
> - struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
> - struct ulist *roots;
> - struct ulist_iterator uiter;
> - struct ulist_node *root_node = NULL;
> - int ret;
> -
> - if (!root->fs_info->quota_enabled)
> - return 1;
> -
> - btrfs_get_tree_mod_seq(root->fs_info, &tree_mod_seq_elem);
> - ret = btrfs_find_all_roots(trans, root->fs_info, disko,
> - tree_mod_seq_elem.seq, &roots);
> - if (ret < 0)
> - goto out;
> - ret = 0;
> - ULIST_ITER_INIT(&uiter);
> - while ((root_node = ulist_next(roots, &uiter))) {
> - if (root_node->val == root->objectid) {
> - ret = 1;
> - break;
> - }
> - }
> - ulist_free(roots);
> -out:
> - btrfs_put_tree_mod_seq(root->fs_info, &tree_mod_seq_elem);
> - return ret;
> -}
> -
> static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
> struct inode *inode,
> u64 endoff,
> @@ -3499,9 +3464,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> u32 nritems;
> int slot;
> int ret;
> - int no_quota;
> const u64 len = olen_aligned;
> - u64 last_disko = 0;
> u64 last_dest_end = destoff;
>
> ret = -ENOMEM;
> @@ -3547,7 +3510,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>
> nritems = btrfs_header_nritems(path->nodes[0]);
> process_slot:
> - no_quota = 1;
> if (path->slots[0] >= nritems) {
> ret = btrfs_next_leaf(BTRFS_I(src)->root, path);
> if (ret < 0)
> @@ -3699,35 +3661,13 @@ process_slot:
> btrfs_set_file_extent_num_bytes(leaf, extent,
> datal);
>
> - /*
> - * We need to look up the roots that point at
> - * this bytenr and see if the new root does. If
> - * it does not we need to make sure we update
> - * quotas appropriately.
> - */
> - if (disko && root != BTRFS_I(src)->root &&
> - disko != last_disko) {
> - no_quota = check_ref(trans, root,
> - disko);
> - if (no_quota < 0) {
> - btrfs_abort_transaction(trans,
> - root,
> - ret);
> - btrfs_end_transaction(trans,
> - root);
> - ret = no_quota;
> - goto out;
> - }
> - }
> -
> if (disko) {
> inode_add_bytes(inode, datal);
> ret = btrfs_inc_extent_ref(trans, root,
> disko, diskl, 0,
> root->root_key.objectid,
> btrfs_ino(inode),
> - new_key.offset - datao,
> - no_quota);
> + new_key.offset - datao);
> if (ret) {
> btrfs_abort_transaction(trans,
> root,
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index a7dc456..b4ca545 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1716,7 +1716,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
> ret = btrfs_inc_extent_ref(trans, root, new_bytenr,
> num_bytes, parent,
> btrfs_header_owner(leaf),
> - key.objectid, key.offset, 1);
> + key.objectid, key.offset);
> if (ret) {
> btrfs_abort_transaction(trans, root, ret);
> break;
> @@ -1724,7 +1724,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>
> ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
> parent, btrfs_header_owner(leaf),
> - key.objectid, key.offset, 1);
> + key.objectid, key.offset);
> if (ret) {
> btrfs_abort_transaction(trans, root, ret);
> break;
> @@ -1900,23 +1900,21 @@ again:
>
> ret = btrfs_inc_extent_ref(trans, src, old_bytenr, blocksize,
> path->nodes[level]->start,
> - src->root_key.objectid, level - 1, 0,
> - 1);
> + src->root_key.objectid, level - 1, 0);
> BUG_ON(ret);
> ret = btrfs_inc_extent_ref(trans, dest, new_bytenr, blocksize,
> 0, dest->root_key.objectid, level - 1,
> - 0, 1);
> + 0);
> BUG_ON(ret);
>
> ret = btrfs_free_extent(trans, src, new_bytenr, blocksize,
> path->nodes[level]->start,
> - src->root_key.objectid, level - 1, 0,
> - 1);
> + src->root_key.objectid, level - 1, 0);
> BUG_ON(ret);
>
> ret = btrfs_free_extent(trans, dest, old_bytenr, blocksize,
> 0, dest->root_key.objectid, level - 1,
> - 0, 1);
> + 0);
> BUG_ON(ret);
>
> btrfs_unlock_up_safe(path, 0);
> @@ -2745,7 +2743,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
> node->eb->start, blocksize,
> upper->eb->start,
> btrfs_header_owner(upper->eb),
> - node->level, 0, 1);
> + node->level, 0);
> BUG_ON(ret);
>
> ret = btrfs_drop_subtree(trans, root, eb, upper->eb);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 1fffe88..323e12c 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -693,7 +693,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
> ret = btrfs_inc_extent_ref(trans, root,
> ins.objectid, ins.offset,
> 0, root->root_key.objectid,
> - key->objectid, offset, 0);
> + key->objectid, offset);
> if (ret)
> goto out;
> } else {
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v2] Btrfs: fix regression when running delayed references
2015-10-25 10:04 [PATCH 1/2 v2] Btrfs: fix regression when running delayed references fdmanana
2015-10-25 10:04 ` [PATCH 2/2 v2] Btrfs: fix regression running delayed references when using qgroups fdmanana
2015-10-25 18:51 ` [PATCH 1/2 v3] Btrfs: fix regression when running delayed references fdmanana
@ 2015-10-26 9:22 ` Liu Bo
2015-10-26 9:27 ` Filipe Manana
2 siblings, 1 reply; 10+ messages in thread
From: Liu Bo @ 2015-10-26 9:22 UTC (permalink / raw)
To: fdmanana, linux-btrfs; +Cc: stephane_btrfs, Filipe Manana
On 10/25/2015 06:04 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
>
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
> [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
> [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
> [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
> [35908.065201] Stack:
> [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
> [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
> [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
> [35908.065201] Call Trace:
> [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
> [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
> [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
> [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
> [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
> [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
> [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP <ffff88010c4cbb08>
> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> This happens because the new delayed references code no longer merges
> delayed references that have different sequence values. The following
> steps are an example sequence leading to this issue:
>
> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>
> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
> bytenr A is created, with a value of 1 and a seq value of 0;
>
> 3) fs_info->tree_mod_seq is incremented to 1;
>
> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
> btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
> later returns the metadata extent associated to extent buffer A to
> the free space cache (the range is not pinned), because the extent
> buffer was created in the current transaction (N) and writeback never
> happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
> in the extent buffer).
> This creates the delayed reference Ref2 for bytenr A, with a value
> of -1 and a seq value of 1;
>
> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
> because they have different sequence numbers (decided at
> add_delayed_ref_tail_merge());
>
> 6) fs_info->tree_mod_seq is incremented to 2;
>
> 7) Some task attempts to allocate a new extent buffer (done at
> extent-tree.c:find_free_extent()), but due to heavy fragmentation
> and running low on metadata space the clustered allocation fails
> and we fall back to unclustered allocation, which finds the
> extent at offset A, so a new extent buffer at offset A is allocated.
> This creates delayed reference Ref3 for bytenr A, with a value of -1
The value is 1 if we create an eb.
Others look good.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
thanks,
-liubo
> and a seq value of 2;
>
> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
> all have different seq values;
>
> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>
> 10) The delayed Ref1 is the first one being applied, which ends up
> creating an inline extent backref in the extent tree;
>
> 10) Next the delayed reference Ref3 is selected for execution, and not
> Ref2, because select_delayed_ref() always gives a preference for
> positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>
> 11) When running Ref3 we encounter alreay the inline extent backref
> in the extent tree at insert_inline_extent_backref(), which makes
> us hit the following BUG_ON:
>
> BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>
> This is always true because owner corresponds to the level of the
> extent buffer/btree node in the btree.
>
> For the scenario described above we hit the BUG_ON because we never merge
> references that have different seq values.
>
> We used to do the merging before the 4.2 kernel, more specifically, before
> the commmits:
>
> c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
> c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>
> This issue became more exposed after the following change that was added
> to 4.2 as well:
>
> cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>
> Which in turn fixed another regression by the two commits previously
> mentioned.
>
> So fix this by bringing back the delayed reference merge code, with the
> proper adaptations so that it operates against the new data structure
> (linked list vs old red black tree implementation).
>
> This issue was hit running fstest btrfs/063 in a loop. Several people have
> reported this issue in the mailing list when running on kernels 4.2+.
>
> Very special thanks to Stéphane Lesimple for helping debugging this issue
> and testing this fix on his multi terabyte filesystem (which took more
> than one day to balance alone, plus fsck, etc).
>
> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
> Reported-by: Peter Becker <floyd.net@gmail.com>
> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Reported-by: Malte Schröder <malte@tnxip.de>
> Reported-by: Derek Dongray <derek@valedon.co.uk>
> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
> Cc: stable@vger.kernel.org # 4.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>
> V2: No code changes. Added Stéphane's Tested-by tag and made patch
> part of a 2 fixes series, to reflect the other remaining fix
> depends on this one.
>
> fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/extent-tree.c | 14 ++++++
> 2 files changed, 127 insertions(+)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index ac3e81d..4832943 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
> trans->delayed_ref_updates--;
> }
>
> +static bool merge_ref(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_root *delayed_refs,
> + struct btrfs_delayed_ref_head *head,
> + struct btrfs_delayed_ref_node *ref,
> + u64 seq)
> +{
> + struct btrfs_delayed_ref_node *next;
> + bool done = false;
> +
> + next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
> + list);
> + while (!done && &next->list != &head->ref_list) {
> + int mod;
> + struct btrfs_delayed_ref_node *next2;
> +
> + next2 = list_next_entry(next, list);
> +
> + if (next == ref)
> + goto next;
> +
> + if (seq && next->seq >= seq)
> + goto next;
> +
> + if (next->type != ref->type || next->no_quota != ref->no_quota)
> + goto next;
> +
> + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
> + comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
> + btrfs_delayed_node_to_tree_ref(next),
> + ref->type))
> + goto next;
> + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
> + ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
> + comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
> + btrfs_delayed_node_to_data_ref(next)))
> + goto next;
> +
> + if (ref->action == next->action) {
> + mod = next->ref_mod;
> + } else {
> + if (ref->ref_mod < next->ref_mod) {
> + swap(ref, next);
> + done = true;
> + }
> + mod = -next->ref_mod;
> + }
> +
> + drop_delayed_ref(trans, delayed_refs, head, next);
> + ref->ref_mod += mod;
> + if (ref->ref_mod == 0) {
> + drop_delayed_ref(trans, delayed_refs, head, ref);
> + done = true;
> + } else {
> + /*
> + * Can't have multiples of the same ref on a tree block.
> + */
> + WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
> + }
> +next:
> + next = next2;
> + }
> +
> + return done;
> +}
> +
> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
> + struct btrfs_fs_info *fs_info,
> + struct btrfs_delayed_ref_root *delayed_refs,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_delayed_ref_node *ref;
> + u64 seq = 0;
> +
> + assert_spin_locked(&head->lock);
> +
> + if (list_empty(&head->ref_list))
> + return;
> +
> + /* We don't have too many refs to merge for data. */
> + if (head->is_data)
> + return;
> +
> + spin_lock(&fs_info->tree_mod_seq_lock);
> + if (!list_empty(&fs_info->tree_mod_seq_list)) {
> + struct seq_list *elem;
> +
> + elem = list_first_entry(&fs_info->tree_mod_seq_list,
> + struct seq_list, list);
> + seq = elem->seq;
> + }
> + spin_unlock(&fs_info->tree_mod_seq_lock);
> +
> + ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
> + list);
> + while (&ref->list != &head->ref_list) {
> + if (seq && ref->seq >= seq)
> + goto next;
> +
> + if (merge_ref(trans, delayed_refs, head, ref, seq)) {
> + if (list_empty(&head->ref_list))
> + break;
> + ref = list_first_entry(&head->ref_list,
> + struct btrfs_delayed_ref_node,
> + list);
> + continue;
> + }
> +next:
> + ref = list_next_entry(ref, list);
> + }
> +}
> +
> int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
> struct btrfs_delayed_ref_root *delayed_refs,
> u64 seq)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2df4bc7..dd5756c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
> }
> }
>
> + /*
> + * We need to try and merge add/drops of the same ref since we
> + * can run into issues with relocate dropping the implicit ref
> + * and then it being added back again before the drop can
> + * finish. If we merged anything we need to re-loop so we can
> + * get a good ref.
> + * Or we can get node references of the same type that weren't
> + * merged when created due to bumps in the tree mod seq, and
> + * we need to merge them to prevent adding an inline extent
> + * backref before dropping it (triggering a BUG_ON at
> + * insert_inline_extent_backref()).
> + */
> spin_lock(&locked_ref->lock);
> + btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
> + locked_ref);
>
> /*
> * locked_ref is the head node, so we have to go one
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v2] Btrfs: fix regression when running delayed references
2015-10-26 9:22 ` [PATCH 1/2 v2] " Liu Bo
@ 2015-10-26 9:27 ` Filipe Manana
0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2015-10-26 9:27 UTC (permalink / raw)
To: Liu Bo; +Cc: linux-btrfs@vger.kernel.org, Stéphane Lesimple,
Filipe Manana
On Mon, Oct 26, 2015 at 9:22 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On 10/25/2015 06:04 PM, fdmanana@kernel.org wrote:
>>
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>> references implementation in order to fix certain problems with qgroups.
>> However that rework introduced one more regression that leads to the
>> following trace when running delayed references for metadata:
>>
>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic
>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache
>> sunrpc loop fuse parport_pc psmouse i2
>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W
>> 4.3.0-rc5-btrfs-next-17+ #1
>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
>> [btrfs]
>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:
>> ffff88010c4c8000
>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>]
>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>> 0000000000000000
>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>> 0000000000000000
>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>> ffff88010c4cb9f8
>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>> 0000000000000000
>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>> 0000000000000000
>> [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000)
>> knlGS:0000000000000000
>> [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>> 00000000000006e0
>> [35908.065201] Stack:
>> [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>> ffff88015a408000
>> [35908.065201] ffff880154a44000 0000000000000000 0000000000000005
>> ffff88010c4cbbd8
>> [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000
>> 0000000000000000
>> [35908.065201] Call Trace:
>> [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208
>> [btrfs]
>> [35908.065201] [<ffffffffa0497117>] ?
>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>> [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33
>> [btrfs]
>> [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f
>> [btrfs]
>> [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f
>> [btrfs]
>> [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd
>> [btrfs]
>> [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b
>> [btrfs]
>> [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a
>> [btrfs]
>> [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14
>> [btrfs]
>> [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>> [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
>> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
>> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48
>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02
>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>> [35908.065201] RIP [<ffffffffa04928b5>]
>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP <ffff88010c4cbb08>
>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>
>> This happens because the new delayed references code no longer merges
>> delayed references that have different sequence values. The following
>> steps are an example sequence leading to this issue:
>>
>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>
>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>> bytenr A is created, with a value of 1 and a seq value of 0;
>>
>> 3) fs_info->tree_mod_seq is incremented to 1;
>>
>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>> btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>> later returns the metadata extent associated to extent buffer A to
>> the free space cache (the range is not pinned), because the extent
>> buffer was created in the current transaction (N) and writeback never
>> happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>> in the extent buffer).
>> This creates the delayed reference Ref2 for bytenr A, with a value
>> of -1 and a seq value of 1;
>>
>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>> because they have different sequence numbers (decided at
>> add_delayed_ref_tail_merge());
>>
>> 6) fs_info->tree_mod_seq is incremented to 2;
>>
>> 7) Some task attempts to allocate a new extent buffer (done at
>> extent-tree.c:find_free_extent()), but due to heavy fragmentation
>> and running low on metadata space the clustered allocation fails
>> and we fall back to unclustered allocation, which finds the
>> extent at offset A, so a new extent buffer at offset A is allocated.
>> This creates delayed reference Ref3 for bytenr A, with a value of -1
>
>
> The value is 1 if we create an eb.
Indeed, typo or copy paste error.
Fixed in my repo.
thanks
>
> Others look good.
>
> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
>
> thanks,
>
> -liubo
>
>
>> and a seq value of 2;
>>
>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>> all have different seq values;
>>
>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>
>> 10) The delayed Ref1 is the first one being applied, which ends up
>> creating an inline extent backref in the extent tree;
>>
>> 10) Next the delayed reference Ref3 is selected for execution, and not
>> Ref2, because select_delayed_ref() always gives a preference for
>> positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>
>> 11) When running Ref3 we encounter alreay the inline extent backref
>> in the extent tree at insert_inline_extent_backref(), which makes
>> us hit the following BUG_ON:
>>
>> BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>
>> This is always true because owner corresponds to the level of the
>> extent buffer/btree node in the btree.
>>
>> For the scenario described above we hit the BUG_ON because we never merge
>> references that have different seq values.
>>
>> We used to do the merging before the 4.2 kernel, more specifically, before
>> the commmits:
>>
>> c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in
>> ref_head.")
>> c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>
>> This issue became more exposed after the following change that was added
>> to 4.2 as well:
>>
>> cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>
>> Which in turn fixed another regression by the two commits previously
>> mentioned.
>>
>> So fix this by bringing back the delayed reference merge code, with the
>> proper adaptations so that it operates against the new data structure
>> (linked list vs old red black tree implementation).
>>
>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>> reported this issue in the mailing list when running on kernels 4.2+.
>>
>> Very special thanks to Stéphane Lesimple for helping debugging this issue
>> and testing this fix on his multi terabyte filesystem (which took more
>> than one day to balance alone, plus fsck, etc).
>>
>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>> in ref_head.")
>> Reported-by: Peter Becker <floyd.net@gmail.com>
>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>> Reported-by: Malte Schröder <malte@tnxip.de>
>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>> Cc: stable@vger.kernel.org # 4.2+
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>
>> V2: No code changes. Added Stéphane's Tested-by tag and made patch
>> part of a 2 fixes series, to reflect the other remaining fix
>> depends on this one.
>>
>> fs/btrfs/delayed-ref.c | 113
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>> fs/btrfs/extent-tree.c | 14 ++++++
>> 2 files changed, 127 insertions(+)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index ac3e81d..4832943 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>> btrfs_trans_handle *trans,
>> trans->delayed_ref_updates--;
>> }
>>
>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>> + struct btrfs_delayed_ref_root *delayed_refs,
>> + struct btrfs_delayed_ref_head *head,
>> + struct btrfs_delayed_ref_node *ref,
>> + u64 seq)
>> +{
>> + struct btrfs_delayed_ref_node *next;
>> + bool done = false;
>> +
>> + next = list_first_entry(&head->ref_list, struct
>> btrfs_delayed_ref_node,
>> + list);
>> + while (!done && &next->list != &head->ref_list) {
>> + int mod;
>> + struct btrfs_delayed_ref_node *next2;
>> +
>> + next2 = list_next_entry(next, list);
>> +
>> + if (next == ref)
>> + goto next;
>> +
>> + if (seq && next->seq >= seq)
>> + goto next;
>> +
>> + if (next->type != ref->type || next->no_quota !=
>> ref->no_quota)
>> + goto next;
>> +
>> + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>> + comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>> + btrfs_delayed_node_to_tree_ref(next),
>> + ref->type))
>> + goto next;
>> + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>> + ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>> + comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>> + btrfs_delayed_node_to_data_ref(next)))
>> + goto next;
>> +
>> + if (ref->action == next->action) {
>> + mod = next->ref_mod;
>> + } else {
>> + if (ref->ref_mod < next->ref_mod) {
>> + swap(ref, next);
>> + done = true;
>> + }
>> + mod = -next->ref_mod;
>> + }
>> +
>> + drop_delayed_ref(trans, delayed_refs, head, next);
>> + ref->ref_mod += mod;
>> + if (ref->ref_mod == 0) {
>> + drop_delayed_ref(trans, delayed_refs, head, ref);
>> + done = true;
>> + } else {
>> + /*
>> + * Can't have multiples of the same ref on a tree
>> block.
>> + */
>> + WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>> + }
>> +next:
>> + next = next2;
>> + }
>> +
>> + return done;
>> +}
>> +
>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>> + struct btrfs_fs_info *fs_info,
>> + struct btrfs_delayed_ref_root *delayed_refs,
>> + struct btrfs_delayed_ref_head *head)
>> +{
>> + struct btrfs_delayed_ref_node *ref;
>> + u64 seq = 0;
>> +
>> + assert_spin_locked(&head->lock);
>> +
>> + if (list_empty(&head->ref_list))
>> + return;
>> +
>> + /* We don't have too many refs to merge for data. */
>> + if (head->is_data)
>> + return;
>> +
>> + spin_lock(&fs_info->tree_mod_seq_lock);
>> + if (!list_empty(&fs_info->tree_mod_seq_list)) {
>> + struct seq_list *elem;
>> +
>> + elem = list_first_entry(&fs_info->tree_mod_seq_list,
>> + struct seq_list, list);
>> + seq = elem->seq;
>> + }
>> + spin_unlock(&fs_info->tree_mod_seq_lock);
>> +
>> + ref = list_first_entry(&head->ref_list, struct
>> btrfs_delayed_ref_node,
>> + list);
>> + while (&ref->list != &head->ref_list) {
>> + if (seq && ref->seq >= seq)
>> + goto next;
>> +
>> + if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>> + if (list_empty(&head->ref_list))
>> + break;
>> + ref = list_first_entry(&head->ref_list,
>> + struct
>> btrfs_delayed_ref_node,
>> + list);
>> + continue;
>> + }
>> +next:
>> + ref = list_next_entry(ref, list);
>> + }
>> +}
>> +
>> int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>> struct btrfs_delayed_ref_root *delayed_refs,
>> u64 seq)
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 2df4bc7..dd5756c 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct
>> btrfs_trans_handle *trans,
>> }
>> }
>>
>> + /*
>> + * We need to try and merge add/drops of the same ref
>> since we
>> + * can run into issues with relocate dropping the implicit
>> ref
>> + * and then it being added back again before the drop can
>> + * finish. If we merged anything we need to re-loop so we
>> can
>> + * get a good ref.
>> + * Or we can get node references of the same type that
>> weren't
>> + * merged when created due to bumps in the tree mod seq,
>> and
>> + * we need to merge them to prevent adding an inline
>> extent
>> + * backref before dropping it (triggering a BUG_ON at
>> + * insert_inline_extent_backref()).
>> + */
>> spin_lock(&locked_ref->lock);
>> + btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>> + locked_ref);
>>
>> /*
>> * locked_ref is the head node, so we have to go one
>>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
2015-10-25 18:51 ` [PATCH 1/2 v3] Btrfs: fix regression when running delayed references fdmanana
@ 2015-12-13 10:51 ` Alex Lyakas
2015-12-13 15:43 ` Filipe Manana
0 siblings, 1 reply; 10+ messages in thread
From: Alex Lyakas @ 2015-12-13 10:51 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs, Filipe Manana
Hi Filipe Manana,
My understanding of selecting delayed refs to run or merging them is
far from complete. Can you please explain what will happen in the
following scenario:
1) Ref1 is created, as you explain
2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
up with an EXTENT_ITEM and an inline extent back ref
3) Ref2 and Ref3 are added
4) Somebody calls __btrfs_run_delayed_refs()
At this point, we cannot merge Ref2 and Ref3, because they might be
referencing tree blocks of completely different trees, thus
comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
run, because we prefer BTRFS_ADD_DELAYED_REF over
BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
now, because we already have Ref1 in the extent tree.
So something should prevent us from running Ref3 before running Ref2.
We should run Ref2 first, which should get rid of the EXTENT_ITEM and
the inline backref, and then run Ref3 to create a new backref with a
proper owner. What is that something?
Can you please point me at what am I missing?
Also, can such scenario happen in 3.18 kernel, which still has an
rbtree per ref-head? Looking at the code, I don't see anything
preventing that from happening.
Thanks,
Alex.
On Sun, Oct 25, 2015 at 8:51 PM, <fdmanana@kernel.org> wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
>
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
> [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
> [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
> [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
> [35908.065201] Stack:
> [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
> [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
> [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
> [35908.065201] Call Trace:
> [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
> [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
> [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
> [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
> [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
> [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
> [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP <ffff88010c4cbb08>
> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> This happens because the new delayed references code no longer merges
> delayed references that have different sequence values. The following
> steps are an example sequence leading to this issue:
>
> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>
> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
> bytenr A is created, with a value of 1 and a seq value of 0;
What happens if Ref1 is run at this point?
>
> 3) fs_info->tree_mod_seq is incremented to 1;
>
> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
> btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
> later returns the metadata extent associated to extent buffer A to
> the free space cache (the range is not pinned), because the extent
> buffer was created in the current transaction (N) and writeback never
> happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
> in the extent buffer).
> This creates the delayed reference Ref2 for bytenr A, with a value
> of -1 and a seq value of 1;
>
> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
> because they have different sequence numbers (decided at
> add_delayed_ref_tail_merge());
>
> 6) fs_info->tree_mod_seq is incremented to 2;
>
> 7) Some task attempts to allocate a new extent buffer (done at
> extent-tree.c:find_free_extent()), but due to heavy fragmentation
> and running low on metadata space the clustered allocation fails
> and we fall back to unclustered allocation, which finds the
> extent at offset A, so a new extent buffer at offset A is allocated.
> This creates delayed reference Ref3 for bytenr A, with a value of -1
> and a seq value of 2;
>
> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
> all have different seq values;
Can Ref3 be merged with Ref1? They might be referencing blocks of
completely different trees (although using the same bytenr). So they
will have different root/parent and thus cannot be merged, I think.
>
> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>
> 10) The delayed Ref1 is the first one being applied, which ends up
> creating an inline extent backref in the extent tree;
>
> 10) Next the delayed reference Ref3 is selected for execution, and not
> Ref2, because select_delayed_ref() always gives a preference for
> positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>
> 11) When running Ref3 we encounter alreay the inline extent backref
> in the extent tree at insert_inline_extent_backref(), which makes
> us hit the following BUG_ON:
>
> BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>
> This is always true because owner corresponds to the level of the
> extent buffer/btree node in the btree.
>
> For the scenario described above we hit the BUG_ON because we never merge
> references that have different seq values.
>
> We used to do the merging before the 4.2 kernel, more specifically, before
> the commmits:
>
> c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
> c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>
> This issue became more exposed after the following change that was added
> to 4.2 as well:
>
> cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>
> Which in turn fixed another regression by the two commits previously
> mentioned.
>
> So fix this by bringing back the delayed reference merge code, with the
> proper adaptations so that it operates against the new data structure
> (linked list vs old red black tree implementation).
>
> This issue was hit running fstest btrfs/063 in a loop. Several people have
> reported this issue in the mailing list when running on kernels 4.2+.
>
> Very special thanks to Stéphane Lesimple for helping debugging this issue
> and testing this fix on his multi terabyte filesystem (which took more
> than one day to balance alone, plus fsck, etc).
>
> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
> Reported-by: Peter Becker <floyd.net@gmail.com>
> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Reported-by: Malte Schröder <malte@tnxip.de>
> Reported-by: Derek Dongray <derek@valedon.co.uk>
> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
> Cc: stable@vger.kernel.org # 4.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>
> V2: No code changes. Added Stéphane's Tested-by tag and made patch
> part of a 2 fixes series, to reflect the other remaining fix
> depends on this one.
> V3: No changes. Updated only the 2nd patch in the series to v3 (change log
> updated).
>
> fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/extent-tree.c | 14 ++++++
> 2 files changed, 127 insertions(+)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index bd9b63b..2f41580 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
> trans->delayed_ref_updates--;
> }
>
> +static bool merge_ref(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_root *delayed_refs,
> + struct btrfs_delayed_ref_head *head,
> + struct btrfs_delayed_ref_node *ref,
> + u64 seq)
> +{
> + struct btrfs_delayed_ref_node *next;
> + bool done = false;
> +
> + next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
> + list);
> + while (!done && &next->list != &head->ref_list) {
> + int mod;
> + struct btrfs_delayed_ref_node *next2;
> +
> + next2 = list_next_entry(next, list);
> +
> + if (next == ref)
> + goto next;
> +
> + if (seq && next->seq >= seq)
> + goto next;
> +
> + if (next->type != ref->type || next->no_quota != ref->no_quota)
> + goto next;
> +
> + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
> + comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
> + btrfs_delayed_node_to_tree_ref(next),
> + ref->type))
> + goto next;
> + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
> + ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
> + comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
> + btrfs_delayed_node_to_data_ref(next)))
> + goto next;
> +
> + if (ref->action == next->action) {
> + mod = next->ref_mod;
> + } else {
> + if (ref->ref_mod < next->ref_mod) {
> + swap(ref, next);
> + done = true;
> + }
> + mod = -next->ref_mod;
> + }
> +
> + drop_delayed_ref(trans, delayed_refs, head, next);
> + ref->ref_mod += mod;
> + if (ref->ref_mod == 0) {
> + drop_delayed_ref(trans, delayed_refs, head, ref);
> + done = true;
> + } else {
> + /*
> + * Can't have multiples of the same ref on a tree block.
> + */
> + WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
> + }
> +next:
> + next = next2;
> + }
> +
> + return done;
> +}
> +
> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
> + struct btrfs_fs_info *fs_info,
> + struct btrfs_delayed_ref_root *delayed_refs,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_delayed_ref_node *ref;
> + u64 seq = 0;
> +
> + assert_spin_locked(&head->lock);
> +
> + if (list_empty(&head->ref_list))
> + return;
> +
> + /* We don't have too many refs to merge for data. */
> + if (head->is_data)
> + return;
> +
> + spin_lock(&fs_info->tree_mod_seq_lock);
> + if (!list_empty(&fs_info->tree_mod_seq_list)) {
> + struct seq_list *elem;
> +
> + elem = list_first_entry(&fs_info->tree_mod_seq_list,
> + struct seq_list, list);
> + seq = elem->seq;
> + }
> + spin_unlock(&fs_info->tree_mod_seq_lock);
> +
> + ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
> + list);
> + while (&ref->list != &head->ref_list) {
> + if (seq && ref->seq >= seq)
> + goto next;
> +
> + if (merge_ref(trans, delayed_refs, head, ref, seq)) {
> + if (list_empty(&head->ref_list))
> + break;
> + ref = list_first_entry(&head->ref_list,
> + struct btrfs_delayed_ref_node,
> + list);
> + continue;
> + }
> +next:
> + ref = list_next_entry(ref, list);
> + }
> +}
> +
> int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
> struct btrfs_delayed_ref_root *delayed_refs,
> u64 seq)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 92fdbc6..c0f30f5 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2502,7 +2502,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
> }
> }
>
> + /*
> + * We need to try and merge add/drops of the same ref since we
> + * can run into issues with relocate dropping the implicit ref
> + * and then it being added back again before the drop can
> + * finish. If we merged anything we need to re-loop so we can
> + * get a good ref.
> + * Or we can get node references of the same type that weren't
> + * merged when created due to bumps in the tree mod seq, and
> + * we need to merge them to prevent adding an inline extent
> + * backref before dropping it (triggering a BUG_ON at
> + * insert_inline_extent_backref()).
> + */
> spin_lock(&locked_ref->lock);
> + btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
> + locked_ref);
>
> /*
> * locked_ref is the head node, so we have to go one
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
2015-12-13 10:51 ` Alex Lyakas
@ 2015-12-13 15:43 ` Filipe Manana
2015-12-13 16:09 ` Alex Lyakas
0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2015-12-13 15:43 UTC (permalink / raw)
To: Alex Lyakas; +Cc: linux-btrfs, Filipe Manana
On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
> Hi Filipe Manana,
>
> My understanding of selecting delayed refs to run or merging them is
> far from complete. Can you please explain what will happen in the
> following scenario:
>
> 1) Ref1 is created, as you explain
> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
> up with an EXTENT_ITEM and an inline extent back ref
> 3) Ref2 and Ref3 are added
> 4) Somebody calls __btrfs_run_delayed_refs()
>
> At this point, we cannot merge Ref2 and Ref3, because they might be
> referencing tree blocks of completely different trees, thus
> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
> run, because we prefer BTRFS_ADD_DELAYED_REF over
> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
> now, because we already have Ref1 in the extent tree.
No, that won't happen. If the ref (Ref3) is for a different tree, than
it has a different inline extent from Ref1
(lookup_inline_extent_backref returns -ENOENT and not 0).
If they are all for the same tree it means Ref3 is not merged with
Ref2 because they have different seq numbers and a seq value exist in
fs_info->tree_mod_seq_list, and we skip Ref3 through
btrfs_check_delayed_seq() until such seq number goes away from
tree_mod_seq_list. If no seq number exists in tree_mod_seq_list then
we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
running delayed refs, with Ref2 (which removes both refs since one is
"-1" and the other "+1").
Iow, after this regression fix, no behaviour changed from releases before 4.2.
>
> So something should prevent us from running Ref3 before running Ref2.
> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
> the inline backref, and then run Ref3 to create a new backref with a
> proper owner. What is that something?
>
> Can you please point me at what am I missing?
>
> Also, can such scenario happen in 3.18 kernel, which still has an
> rbtree per ref-head? Looking at the code, I don't see anything
> preventing that from happening.
>
> Thanks,
> Alex.
>
>
> On Sun, Oct 25, 2015 at 8:51 PM, <fdmanana@kernel.org> wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>> references implementation in order to fix certain problems with qgroups.
>> However that rework introduced one more regression that leads to the
>> following trace when running delayed references for metadata:
>>
>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
>> [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
>> [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
>> [35908.065201] Stack:
>> [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
>> [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
>> [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
>> [35908.065201] Call Trace:
>> [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
>> [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>> [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
>> [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
>> [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
>> [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
>> [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
>> [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
>> [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
>> [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>> [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
>> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
>> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>> [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP <ffff88010c4cbb08>
>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>
>> This happens because the new delayed references code no longer merges
>> delayed references that have different sequence values. The following
>> steps are an example sequence leading to this issue:
>>
>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>
>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>> bytenr A is created, with a value of 1 and a seq value of 0;
>
> What happens if Ref1 is run at this point?
>
>>
>> 3) fs_info->tree_mod_seq is incremented to 1;
>>
>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>> btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>> later returns the metadata extent associated to extent buffer A to
>> the free space cache (the range is not pinned), because the extent
>> buffer was created in the current transaction (N) and writeback never
>> happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>> in the extent buffer).
>> This creates the delayed reference Ref2 for bytenr A, with a value
>> of -1 and a seq value of 1;
>>
>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>> because they have different sequence numbers (decided at
>> add_delayed_ref_tail_merge());
>>
>> 6) fs_info->tree_mod_seq is incremented to 2;
>>
>> 7) Some task attempts to allocate a new extent buffer (done at
>> extent-tree.c:find_free_extent()), but due to heavy fragmentation
>> and running low on metadata space the clustered allocation fails
>> and we fall back to unclustered allocation, which finds the
>> extent at offset A, so a new extent buffer at offset A is allocated.
>> This creates delayed reference Ref3 for bytenr A, with a value of -1
>> and a seq value of 2;
>>
>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>> all have different seq values;
> Can Ref3 be merged with Ref1? They might be referencing blocks of
> completely different trees (although using the same bytenr). So they
> will have different root/parent and thus cannot be merged, I think.
>
>>
>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>
>> 10) The delayed Ref1 is the first one being applied, which ends up
>> creating an inline extent backref in the extent tree;
>>
>> 10) Next the delayed reference Ref3 is selected for execution, and not
>> Ref2, because select_delayed_ref() always gives a preference for
>> positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>
>> 11) When running Ref3 we encounter alreay the inline extent backref
>> in the extent tree at insert_inline_extent_backref(), which makes
>> us hit the following BUG_ON:
>>
>> BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>
>> This is always true because owner corresponds to the level of the
>> extent buffer/btree node in the btree.
>>
>> For the scenario described above we hit the BUG_ON because we never merge
>> references that have different seq values.
>>
>> We used to do the merging before the 4.2 kernel, more specifically, before
>> the commmits:
>>
>> c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
>> c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>
>> This issue became more exposed after the following change that was added
>> to 4.2 as well:
>>
>> cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>
>> Which in turn fixed another regression by the two commits previously
>> mentioned.
>>
>> So fix this by bringing back the delayed reference merge code, with the
>> proper adaptations so that it operates against the new data structure
>> (linked list vs old red black tree implementation).
>>
>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>> reported this issue in the mailing list when running on kernels 4.2+.
>>
>> Very special thanks to Stéphane Lesimple for helping debugging this issue
>> and testing this fix on his multi terabyte filesystem (which took more
>> than one day to balance alone, plus fsck, etc).
>>
>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
>> Reported-by: Peter Becker <floyd.net@gmail.com>
>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>> Reported-by: Malte Schröder <malte@tnxip.de>
>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>> Cc: stable@vger.kernel.org # 4.2+
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>
>> V2: No code changes. Added Stéphane's Tested-by tag and made patch
>> part of a 2 fixes series, to reflect the other remaining fix
>> depends on this one.
>> V3: No changes. Updated only the 2nd patch in the series to v3 (change log
>> updated).
>>
>> fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
>> fs/btrfs/extent-tree.c | 14 ++++++
>> 2 files changed, 127 insertions(+)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index bd9b63b..2f41580 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
>> trans->delayed_ref_updates--;
>> }
>>
>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>> + struct btrfs_delayed_ref_root *delayed_refs,
>> + struct btrfs_delayed_ref_head *head,
>> + struct btrfs_delayed_ref_node *ref,
>> + u64 seq)
>> +{
>> + struct btrfs_delayed_ref_node *next;
>> + bool done = false;
>> +
>> + next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
>> + list);
>> + while (!done && &next->list != &head->ref_list) {
>> + int mod;
>> + struct btrfs_delayed_ref_node *next2;
>> +
>> + next2 = list_next_entry(next, list);
>> +
>> + if (next == ref)
>> + goto next;
>> +
>> + if (seq && next->seq >= seq)
>> + goto next;
>> +
>> + if (next->type != ref->type || next->no_quota != ref->no_quota)
>> + goto next;
>> +
>> + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>> + comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>> + btrfs_delayed_node_to_tree_ref(next),
>> + ref->type))
>> + goto next;
>> + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>> + ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>> + comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>> + btrfs_delayed_node_to_data_ref(next)))
>> + goto next;
>> +
>> + if (ref->action == next->action) {
>> + mod = next->ref_mod;
>> + } else {
>> + if (ref->ref_mod < next->ref_mod) {
>> + swap(ref, next);
>> + done = true;
>> + }
>> + mod = -next->ref_mod;
>> + }
>> +
>> + drop_delayed_ref(trans, delayed_refs, head, next);
>> + ref->ref_mod += mod;
>> + if (ref->ref_mod == 0) {
>> + drop_delayed_ref(trans, delayed_refs, head, ref);
>> + done = true;
>> + } else {
>> + /*
>> + * Can't have multiples of the same ref on a tree block.
>> + */
>> + WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>> + }
>> +next:
>> + next = next2;
>> + }
>> +
>> + return done;
>> +}
>> +
>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>> + struct btrfs_fs_info *fs_info,
>> + struct btrfs_delayed_ref_root *delayed_refs,
>> + struct btrfs_delayed_ref_head *head)
>> +{
>> + struct btrfs_delayed_ref_node *ref;
>> + u64 seq = 0;
>> +
>> + assert_spin_locked(&head->lock);
>> +
>> + if (list_empty(&head->ref_list))
>> + return;
>> +
>> + /* We don't have too many refs to merge for data. */
>> + if (head->is_data)
>> + return;
>> +
>> + spin_lock(&fs_info->tree_mod_seq_lock);
>> + if (!list_empty(&fs_info->tree_mod_seq_list)) {
>> + struct seq_list *elem;
>> +
>> + elem = list_first_entry(&fs_info->tree_mod_seq_list,
>> + struct seq_list, list);
>> + seq = elem->seq;
>> + }
>> + spin_unlock(&fs_info->tree_mod_seq_lock);
>> +
>> + ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
>> + list);
>> + while (&ref->list != &head->ref_list) {
>> + if (seq && ref->seq >= seq)
>> + goto next;
>> +
>> + if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>> + if (list_empty(&head->ref_list))
>> + break;
>> + ref = list_first_entry(&head->ref_list,
>> + struct btrfs_delayed_ref_node,
>> + list);
>> + continue;
>> + }
>> +next:
>> + ref = list_next_entry(ref, list);
>> + }
>> +}
>> +
>> int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>> struct btrfs_delayed_ref_root *delayed_refs,
>> u64 seq)
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 92fdbc6..c0f30f5 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2502,7 +2502,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>> }
>> }
>>
>> + /*
>> + * We need to try and merge add/drops of the same ref since we
>> + * can run into issues with relocate dropping the implicit ref
>> + * and then it being added back again before the drop can
>> + * finish. If we merged anything we need to re-loop so we can
>> + * get a good ref.
>> + * Or we can get node references of the same type that weren't
>> + * merged when created due to bumps in the tree mod seq, and
>> + * we need to merge them to prevent adding an inline extent
>> + * backref before dropping it (triggering a BUG_ON at
>> + * insert_inline_extent_backref()).
>> + */
>> spin_lock(&locked_ref->lock);
>> + btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>> + locked_ref);
>>
>> /*
>> * locked_ref is the head node, so we have to go one
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
2015-12-13 15:43 ` Filipe Manana
@ 2015-12-13 16:09 ` Alex Lyakas
0 siblings, 0 replies; 10+ messages in thread
From: Alex Lyakas @ 2015-12-13 16:09 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs, Filipe Manana
Hi Filipe,
Thank you for the explanation.
On Sun, Dec 13, 2015 at 5:43 PM, Filipe Manana <fdmanana@kernel.org> wrote:
> On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
>> Hi Filipe Manana,
>>
>> My understanding of selecting delayed refs to run or merging them is
>> far from complete. Can you please explain what will happen in the
>> following scenario:
>>
>> 1) Ref1 is created, as you explain
>> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
>> up with an EXTENT_ITEM and an inline extent back ref
>> 3) Ref2 and Ref3 are added
>> 4) Somebody calls __btrfs_run_delayed_refs()
>>
>> At this point, we cannot merge Ref2 and Ref3, because they might be
>> referencing tree blocks of completely different trees, thus
>> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
>> run, because we prefer BTRFS_ADD_DELAYED_REF over
>> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
>> now, because we already have Ref1 in the extent tree.
>
> No, that won't happen. If the ref (Ref3) is for a different tree, than
> it has a different inline extent from Ref1
> (lookup_inline_extent_backref returns -ENOENT and not 0).
Understood. So in this case, we will first add inline ref for Ref3,
and later drop the Ref1 inline ref via update_inline_extent_backref()
by truncating the EXTENT_ITEM. All in the same transaction.
>
> If they are all for the same tree it means Ref3 is not merged with
> Ref2 because they have different seq numbers and a seq value exist in
> fs_info->tree_mod_seq_list, and we skip Ref3 through
> btrfs_check_delayed_seq() until such seq number goes away from
> tree_mod_seq_list.
Ok, so we won't process this ref-head at all, until the "seq problem"
disappears.
> If no seq number exists in tree_mod_seq_list then
> we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
> running delayed refs, with Ref2 (which removes both refs since one is
> "-1" and the other "+1").
So in this case we don't care that the inline ref we have in the
EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for
the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully
equivalent. Interesting.
Thanks!
Alex.
>
> Iow, after this regression fix, no behaviour changed from releases before 4.2.
>
>>
>> So something should prevent us from running Ref3 before running Ref2.
>> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
>> the inline backref, and then run Ref3 to create a new backref with a
>> proper owner. What is that something?
>>
>> Can you please point me at what am I missing?
>>
>> Also, can such scenario happen in 3.18 kernel, which still has an
>> rbtree per ref-head? Looking at the code, I don't see anything
>> preventing that from happening.
>>
>> Thanks,
>> Alex.
>>
>>
>> On Sun, Oct 25, 2015 at 8:51 PM, <fdmanana@kernel.org> wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
>>> [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
>>> [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
>>> [35908.065201] Stack:
>>> [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
>>> [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
>>> [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
>>> [35908.065201] Call Trace:
>>> [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
>>> [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>>> [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
>>> [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
>>> [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
>>> [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
>>> [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
>>> [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
>>> [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
>>> [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>> [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
>>> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>> [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>> [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP <ffff88010c4cbb08>
>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>
>>> This happens because the new delayed references code no longer merges
>>> delayed references that have different sequence values. The following
>>> steps are an example sequence leading to this issue:
>>>
>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>
>>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>>> bytenr A is created, with a value of 1 and a seq value of 0;
>>
>> What happens if Ref1 is run at this point?
>>
>>>
>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>
>>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>> btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>>> later returns the metadata extent associated to extent buffer A to
>>> the free space cache (the range is not pinned), because the extent
>>> buffer was created in the current transaction (N) and writeback never
>>> happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>>> in the extent buffer).
>>> This creates the delayed reference Ref2 for bytenr A, with a value
>>> of -1 and a seq value of 1;
>>>
>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>> because they have different sequence numbers (decided at
>>> add_delayed_ref_tail_merge());
>>>
>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>
>>> 7) Some task attempts to allocate a new extent buffer (done at
>>> extent-tree.c:find_free_extent()), but due to heavy fragmentation
>>> and running low on metadata space the clustered allocation fails
>>> and we fall back to unclustered allocation, which finds the
>>> extent at offset A, so a new extent buffer at offset A is allocated.
>>> This creates delayed reference Ref3 for bytenr A, with a value of -1
>>> and a seq value of 2;
>>>
>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>> all have different seq values;
>> Can Ref3 be merged with Ref1? They might be referencing blocks of
>> completely different trees (although using the same bytenr). So they
>> will have different root/parent and thus cannot be merged, I think.
>>
>>>
>>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>>
>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>> creating an inline extent backref in the extent tree;
>>>
>>> 10) Next the delayed reference Ref3 is selected for execution, and not
>>> Ref2, because select_delayed_ref() always gives a preference for
>>> positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>>
>>> 11) When running Ref3 we encounter alreay the inline extent backref
>>> in the extent tree at insert_inline_extent_backref(), which makes
>>> us hit the following BUG_ON:
>>>
>>> BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>
>>> This is always true because owner corresponds to the level of the
>>> extent buffer/btree node in the btree.
>>>
>>> For the scenario described above we hit the BUG_ON because we never merge
>>> references that have different seq values.
>>>
>>> We used to do the merging before the 4.2 kernel, more specifically, before
>>> the commmits:
>>>
>>> c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
>>> c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>>
>>> This issue became more exposed after the following change that was added
>>> to 4.2 as well:
>>>
>>> cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>>
>>> Which in turn fixed another regression by the two commits previously
>>> mentioned.
>>>
>>> So fix this by bringing back the delayed reference merge code, with the
>>> proper adaptations so that it operates against the new data structure
>>> (linked list vs old red black tree implementation).
>>>
>>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>>> reported this issue in the mailing list when running on kernels 4.2+.
>>>
>>> Very special thanks to Stéphane Lesimple for helping debugging this issue
>>> and testing this fix on his multi terabyte filesystem (which took more
>>> than one day to balance alone, plus fsck, etc).
>>>
>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
>>> Reported-by: Peter Becker <floyd.net@gmail.com>
>>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>> Reported-by: Malte Schröder <malte@tnxip.de>
>>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>>> Cc: stable@vger.kernel.org # 4.2+
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>
>>> V2: No code changes. Added Stéphane's Tested-by tag and made patch
>>> part of a 2 fixes series, to reflect the other remaining fix
>>> depends on this one.
>>> V3: No changes. Updated only the 2nd patch in the series to v3 (change log
>>> updated).
>>>
>>> fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> fs/btrfs/extent-tree.c | 14 ++++++
>>> 2 files changed, 127 insertions(+)
>>>
>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>> index bd9b63b..2f41580 100644
>>> --- a/fs/btrfs/delayed-ref.c
>>> +++ b/fs/btrfs/delayed-ref.c
>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
>>> trans->delayed_ref_updates--;
>>> }
>>>
>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>> + struct btrfs_delayed_ref_root *delayed_refs,
>>> + struct btrfs_delayed_ref_head *head,
>>> + struct btrfs_delayed_ref_node *ref,
>>> + u64 seq)
>>> +{
>>> + struct btrfs_delayed_ref_node *next;
>>> + bool done = false;
>>> +
>>> + next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
>>> + list);
>>> + while (!done && &next->list != &head->ref_list) {
>>> + int mod;
>>> + struct btrfs_delayed_ref_node *next2;
>>> +
>>> + next2 = list_next_entry(next, list);
>>> +
>>> + if (next == ref)
>>> + goto next;
>>> +
>>> + if (seq && next->seq >= seq)
>>> + goto next;
>>> +
>>> + if (next->type != ref->type || next->no_quota != ref->no_quota)
>>> + goto next;
>>> +
>>> + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>> + comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>> + btrfs_delayed_node_to_tree_ref(next),
>>> + ref->type))
>>> + goto next;
>>> + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>> + ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>> + comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>> + btrfs_delayed_node_to_data_ref(next)))
>>> + goto next;
>>> +
>>> + if (ref->action == next->action) {
>>> + mod = next->ref_mod;
>>> + } else {
>>> + if (ref->ref_mod < next->ref_mod) {
>>> + swap(ref, next);
>>> + done = true;
>>> + }
>>> + mod = -next->ref_mod;
>>> + }
>>> +
>>> + drop_delayed_ref(trans, delayed_refs, head, next);
>>> + ref->ref_mod += mod;
>>> + if (ref->ref_mod == 0) {
>>> + drop_delayed_ref(trans, delayed_refs, head, ref);
>>> + done = true;
>>> + } else {
>>> + /*
>>> + * Can't have multiples of the same ref on a tree block.
>>> + */
>>> + WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>>> + }
>>> +next:
>>> + next = next2;
>>> + }
>>> +
>>> + return done;
>>> +}
>>> +
>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>> + struct btrfs_fs_info *fs_info,
>>> + struct btrfs_delayed_ref_root *delayed_refs,
>>> + struct btrfs_delayed_ref_head *head)
>>> +{
>>> + struct btrfs_delayed_ref_node *ref;
>>> + u64 seq = 0;
>>> +
>>> + assert_spin_locked(&head->lock);
>>> +
>>> + if (list_empty(&head->ref_list))
>>> + return;
>>> +
>>> + /* We don't have too many refs to merge for data. */
>>> + if (head->is_data)
>>> + return;
>>> +
>>> + spin_lock(&fs_info->tree_mod_seq_lock);
>>> + if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>> + struct seq_list *elem;
>>> +
>>> + elem = list_first_entry(&fs_info->tree_mod_seq_list,
>>> + struct seq_list, list);
>>> + seq = elem->seq;
>>> + }
>>> + spin_unlock(&fs_info->tree_mod_seq_lock);
>>> +
>>> + ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
>>> + list);
>>> + while (&ref->list != &head->ref_list) {
>>> + if (seq && ref->seq >= seq)
>>> + goto next;
>>> +
>>> + if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>>> + if (list_empty(&head->ref_list))
>>> + break;
>>> + ref = list_first_entry(&head->ref_list,
>>> + struct btrfs_delayed_ref_node,
>>> + list);
>>> + continue;
>>> + }
>>> +next:
>>> + ref = list_next_entry(ref, list);
>>> + }
>>> +}
>>> +
>>> int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>> struct btrfs_delayed_ref_root *delayed_refs,
>>> u64 seq)
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 92fdbc6..c0f30f5 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2502,7 +2502,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>>> }
>>> }
>>>
>>> + /*
>>> + * We need to try and merge add/drops of the same ref since we
>>> + * can run into issues with relocate dropping the implicit ref
>>> + * and then it being added back again before the drop can
>>> + * finish. If we merged anything we need to re-loop so we can
>>> + * get a good ref.
>>> + * Or we can get node references of the same type that weren't
>>> + * merged when created due to bumps in the tree mod seq, and
>>> + * we need to merge them to prevent adding an inline extent
>>> + * backref before dropping it (triggering a BUG_ON at
>>> + * insert_inline_extent_backref()).
>>> + */
>>> spin_lock(&locked_ref->lock);
>>> + btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>>> + locked_ref);
>>>
>>> /*
>>> * locked_ref is the head node, so we have to go one
>>> --
>>> 2.1.3
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-12-13 16:09 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-25 10:04 [PATCH 1/2 v2] Btrfs: fix regression when running delayed references fdmanana
2015-10-25 10:04 ` [PATCH 2/2 v2] Btrfs: fix regression running delayed references when using qgroups fdmanana
2015-10-25 18:51 ` [PATCH 2/2 v3] " fdmanana
2015-10-26 8:32 ` Qu Wenruo
2015-10-25 18:51 ` [PATCH 1/2 v3] Btrfs: fix regression when running delayed references fdmanana
2015-12-13 10:51 ` Alex Lyakas
2015-12-13 15:43 ` Filipe Manana
2015-12-13 16:09 ` Alex Lyakas
2015-10-26 9:22 ` [PATCH 1/2 v2] " Liu Bo
2015-10-26 9:27 ` Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).