From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.kernel.org ([198.145.29.136]:60799 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752531AbbJZJ1j convert rfc822-to-8bit (ORCPT ); Mon, 26 Oct 2015 05:27:39 -0400 Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 7741C207DA for ; Mon, 26 Oct 2015 09:27:37 +0000 (UTC) Received: from mail-io0-f175.google.com (mail-io0-f175.google.com [209.85.223.175]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id D0C9E207CE for ; Mon, 26 Oct 2015 09:27:34 +0000 (UTC) Received: by ioll68 with SMTP id l68so180657122iol.3 for ; Mon, 26 Oct 2015 02:27:34 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <562DF0D8.20502@oracle.com> References: <1445767453-16233-1-git-send-email-fdmanana@kernel.org> <562DF0D8.20502@oracle.com> Date: Mon, 26 Oct 2015 09:27:34 +0000 Message-ID: Subject: Re: [PATCH 1/2 v2] Btrfs: fix regression when running delayed references From: Filipe Manana To: Liu Bo Cc: "linux-btrfs@vger.kernel.org" , =?UTF-8?Q?St=C3=A9phane_Lesimple?= , Filipe Manana Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Mon, Oct 26, 2015 at 9:22 AM, Liu Bo wrote: > On 10/25/2015 06:04 PM, fdmanana@kernel.org wrote: >> >> From: Filipe Manana >> >> In the kernel 4.2 merge window we had a refactoring/rework of the delayed >> references implementation in order to fix certain problems with qgroups. >> However that rework introduced one more regression that leads to the >> following trace when running delayed references for metadata: >> >> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! >> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic >> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache >> sunrpc loop fuse parport_pc psmouse i2 >> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W >> 4.3.0-rc5-btrfs-next-17+ #1 >> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 >> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper >> [btrfs] >> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: >> ffff88010c4c8000 >> [35908.065201] RIP: 0010:[] [] >> insert_inline_extent_backref+0x52/0xb1 [btrfs] >> [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293 >> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: >> 0000000000000000 >> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: >> 0000000000000000 >> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: >> ffff88010c4cb9f8 >> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: >> 0000000000000000 >> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: >> 0000000000000000 >> [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) >> knlGS:0000000000000000 >> [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: >> 00000000000006e0 >> [35908.065201] Stack: >> [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 >> ffff88015a408000 >> [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 >> ffff88010c4cbbd8 >> [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 >> 0000000000000000 >> [35908.065201] Call Trace: >> [35908.065201] [] __btrfs_inc_extent_ref+0x8b/0x208 >> [btrfs] >> [35908.065201] [] ? >> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs] >> [35908.065201] [] __btrfs_run_delayed_refs+0xafa/0xd33 >> [btrfs] >> [35908.065201] [] ? join_transaction.isra.10+0x25/0x41f >> [btrfs] >> [35908.065201] [] ? join_transaction.isra.10+0xa8/0x41f >> [btrfs] >> [35908.065201] [] btrfs_run_delayed_refs+0x75/0x1dd >> [btrfs] >> [35908.065201] [] delayed_ref_async_start+0x3c/0x7b >> [btrfs] >> [35908.065201] [] normal_work_helper+0x14c/0x32a >> [btrfs] >> [35908.065201] [] btrfs_extent_refs_helper+0x12/0x14 >> [btrfs] >> [35908.065201] [] process_one_work+0x24a/0x4ac >> [35908.065201] [] worker_thread+0x206/0x2c2 >> [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb >> [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb >> [35908.065201] [] kthread+0xef/0xf7 >> [35908.065201] [] ? kthread_parkme+0x24/0x24 >> [35908.065201] [] ret_from_fork+0x3f/0x70 >> [35908.065201] [] ? kthread_parkme+0x24/0x24 >> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 >> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 >> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31 >> [35908.065201] RIP [] >> insert_inline_extent_backref+0x52/0xb1 [btrfs] >> [35908.065201] RSP >> [35908.310885] ---[ end trace fe4299baf0666457 ]--- >> >> This happens because the new delayed references code no longer merges >> delayed references that have different sequence values. The following >> steps are an example sequence leading to this issue: >> >> 1) Transaction N starts, fs_info->tree_mod_seq has value 0; >> >> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for >> bytenr A is created, with a value of 1 and a seq value of 0; >> >> 3) fs_info->tree_mod_seq is incremented to 1; >> >> 4) Extent buffer A is deleted through btrfs_del_items(), which calls >> btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The >> later returns the metadata extent associated to extent buffer A to >> the free space cache (the range is not pinned), because the extent >> buffer was created in the current transaction (N) and writeback never >> happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set >> in the extent buffer). >> This creates the delayed reference Ref2 for bytenr A, with a value >> of -1 and a seq value of 1; >> >> 5) Delayed reference Ref2 is not merged with Ref1 when we create it, >> because they have different sequence numbers (decided at >> add_delayed_ref_tail_merge()); >> >> 6) fs_info->tree_mod_seq is incremented to 2; >> >> 7) Some task attempts to allocate a new extent buffer (done at >> extent-tree.c:find_free_extent()), but due to heavy fragmentation >> and running low on metadata space the clustered allocation fails >> and we fall back to unclustered allocation, which finds the >> extent at offset A, so a new extent buffer at offset A is allocated. >> This creates delayed reference Ref3 for bytenr A, with a value of -1 > > > The value is 1 if we create an eb. Indeed, typo or copy paste error. Fixed in my repo. thanks > > Others look good. > > Reviewed-by: Liu Bo > > thanks, > > -liubo > > >> and a seq value of 2; >> >> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they >> all have different seq values; >> >> 9) We start running the delayed references (__btrfs_run_delayed_refs()); >> >> 10) The delayed Ref1 is the first one being applied, which ends up >> creating an inline extent backref in the extent tree; >> >> 10) Next the delayed reference Ref3 is selected for execution, and not >> Ref2, because select_delayed_ref() always gives a preference for >> positive references (that have an action of BTRFS_ADD_DELAYED_REF); >> >> 11) When running Ref3 we encounter alreay the inline extent backref >> in the extent tree at insert_inline_extent_backref(), which makes >> us hit the following BUG_ON: >> >> BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID); >> >> This is always true because owner corresponds to the level of the >> extent buffer/btree node in the btree. >> >> For the scenario described above we hit the BUG_ON because we never merge >> references that have different seq values. >> >> We used to do the merging before the 4.2 kernel, more specifically, before >> the commmits: >> >> c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in >> ref_head.") >> c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.") >> >> This issue became more exposed after the following change that was added >> to 4.2 as well: >> >> cffc3374e567 ("Btrfs: fix order by which delayed references are run") >> >> Which in turn fixed another regression by the two commits previously >> mentioned. >> >> So fix this by bringing back the delayed reference merge code, with the >> proper adaptations so that it operates against the new data structure >> (linked list vs old red black tree implementation). >> >> This issue was hit running fstest btrfs/063 in a loop. Several people have >> reported this issue in the mailing list when running on kernels 4.2+. >> >> Very special thanks to Stéphane Lesimple for helping debugging this issue >> and testing this fix on his multi terabyte filesystem (which took more >> than one day to balance alone, plus fsck, etc). >> >> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root >> in ref_head.") >> Reported-by: Peter Becker >> Reported-by: Stéphane Lesimple >> Tested-by: Stéphane Lesimple >> Reported-by: Malte Schröder >> Reported-by: Derek Dongray >> Reported-by: Erkki Seppala >> Cc: stable@vger.kernel.org # 4.2+ >> Signed-off-by: Filipe Manana >> --- >> >> V2: No code changes. Added Stéphane's Tested-by tag and made patch >> part of a 2 fixes series, to reflect the other remaining fix >> depends on this one. >> >> fs/btrfs/delayed-ref.c | 113 >> +++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/extent-tree.c | 14 ++++++ >> 2 files changed, 127 insertions(+) >> >> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c >> index ac3e81d..4832943 100644 >> --- a/fs/btrfs/delayed-ref.c >> +++ b/fs/btrfs/delayed-ref.c >> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct >> btrfs_trans_handle *trans, >> trans->delayed_ref_updates--; >> } >> >> +static bool merge_ref(struct btrfs_trans_handle *trans, >> + struct btrfs_delayed_ref_root *delayed_refs, >> + struct btrfs_delayed_ref_head *head, >> + struct btrfs_delayed_ref_node *ref, >> + u64 seq) >> +{ >> + struct btrfs_delayed_ref_node *next; >> + bool done = false; >> + >> + next = list_first_entry(&head->ref_list, struct >> btrfs_delayed_ref_node, >> + list); >> + while (!done && &next->list != &head->ref_list) { >> + int mod; >> + struct btrfs_delayed_ref_node *next2; >> + >> + next2 = list_next_entry(next, list); >> + >> + if (next == ref) >> + goto next; >> + >> + if (seq && next->seq >= seq) >> + goto next; >> + >> + if (next->type != ref->type || next->no_quota != >> ref->no_quota) >> + goto next; >> + >> + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY || >> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) && >> + comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref), >> + btrfs_delayed_node_to_tree_ref(next), >> + ref->type)) >> + goto next; >> + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY || >> + ref->type == BTRFS_SHARED_DATA_REF_KEY) && >> + comp_data_refs(btrfs_delayed_node_to_data_ref(ref), >> + btrfs_delayed_node_to_data_ref(next))) >> + goto next; >> + >> + if (ref->action == next->action) { >> + mod = next->ref_mod; >> + } else { >> + if (ref->ref_mod < next->ref_mod) { >> + swap(ref, next); >> + done = true; >> + } >> + mod = -next->ref_mod; >> + } >> + >> + drop_delayed_ref(trans, delayed_refs, head, next); >> + ref->ref_mod += mod; >> + if (ref->ref_mod == 0) { >> + drop_delayed_ref(trans, delayed_refs, head, ref); >> + done = true; >> + } else { >> + /* >> + * Can't have multiples of the same ref on a tree >> block. >> + */ >> + WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY || >> + ref->type == BTRFS_SHARED_BLOCK_REF_KEY); >> + } >> +next: >> + next = next2; >> + } >> + >> + return done; >> +} >> + >> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, >> + struct btrfs_fs_info *fs_info, >> + struct btrfs_delayed_ref_root *delayed_refs, >> + struct btrfs_delayed_ref_head *head) >> +{ >> + struct btrfs_delayed_ref_node *ref; >> + u64 seq = 0; >> + >> + assert_spin_locked(&head->lock); >> + >> + if (list_empty(&head->ref_list)) >> + return; >> + >> + /* We don't have too many refs to merge for data. */ >> + if (head->is_data) >> + return; >> + >> + spin_lock(&fs_info->tree_mod_seq_lock); >> + if (!list_empty(&fs_info->tree_mod_seq_list)) { >> + struct seq_list *elem; >> + >> + elem = list_first_entry(&fs_info->tree_mod_seq_list, >> + struct seq_list, list); >> + seq = elem->seq; >> + } >> + spin_unlock(&fs_info->tree_mod_seq_lock); >> + >> + ref = list_first_entry(&head->ref_list, struct >> btrfs_delayed_ref_node, >> + list); >> + while (&ref->list != &head->ref_list) { >> + if (seq && ref->seq >= seq) >> + goto next; >> + >> + if (merge_ref(trans, delayed_refs, head, ref, seq)) { >> + if (list_empty(&head->ref_list)) >> + break; >> + ref = list_first_entry(&head->ref_list, >> + struct >> btrfs_delayed_ref_node, >> + list); >> + continue; >> + } >> +next: >> + ref = list_next_entry(ref, list); >> + } >> +} >> + >> int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, >> struct btrfs_delayed_ref_root *delayed_refs, >> u64 seq) >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 2df4bc7..dd5756c 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct >> btrfs_trans_handle *trans, >> } >> } >> >> + /* >> + * We need to try and merge add/drops of the same ref >> since we >> + * can run into issues with relocate dropping the implicit >> ref >> + * and then it being added back again before the drop can >> + * finish. If we merged anything we need to re-loop so we >> can >> + * get a good ref. >> + * Or we can get node references of the same type that >> weren't >> + * merged when created due to bumps in the tree mod seq, >> and >> + * we need to merge them to prevent adding an inline >> extent >> + * backref before dropping it (triggering a BUG_ON at >> + * insert_inline_extent_backref()). >> + */ >> spin_lock(&locked_ref->lock); >> + btrfs_merge_delayed_refs(trans, fs_info, delayed_refs, >> + locked_ref); >> >> /* >> * locked_ref is the head node, so we have to go one >> >