From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86FDBC2BA19 for ; Wed, 15 Apr 2020 17:28:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 574E220784 for ; Wed, 15 Apr 2020 17:28:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586971716; bh=tF5kNUmcsYIYPwMIs1oH29T64Yiu1Tq0ewpvaaI0Lrc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=TeGArf5dBMyqndqhQKKzuKHY8e+rffRQE9K0j+7jhCJaijzfVNq9t4IYx/AORhDyN nPg20GpDqfWREc2V3vnsC2MSHlsvVF/B5aSnvUAUYYByXbBG/j+C4FA4xrF9cF6BrM uopvf63sP1zB6vDtIdRhSm+BpNVAafoSmW8hrATI= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2411065AbgDOR2d (ORCPT ); Wed, 15 Apr 2020 13:28:33 -0400 Received: from mail.kernel.org ([198.145.29.99]:43146 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2411064AbgDOR2X (ORCPT ); Wed, 15 Apr 2020 13:28:23 -0400 Received: from localhost (c-73-47-72-35.hsd1.nh.comcast.net [73.47.72.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7091820784; Wed, 15 Apr 2020 17:28:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586971702; bh=tF5kNUmcsYIYPwMIs1oH29T64Yiu1Tq0ewpvaaI0Lrc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=NzD0KtUZVVWGCdyFGHHFoHibAZiWmGdLQfauuJQJc3qL1oz65WgVa4cJRCDnmzgZc L67vHndDBDX+a0nzfse9VIt63mLwrwN1LfAQo8U0UQqgCh0m4YYU0sk3OkQwBRKWjY uifooEghGgCKwOggLz365gVA3I9TSWcjWuWYGmM4= Date: Wed, 15 Apr 2020 13:28:21 -0400 From: Sasha Levin To: gregkh@linuxfoundation.org Cc: fdmanana@suse.com, dsterba@suse.com, stable@vger.kernel.org Subject: Re: FAILED: patch "[PATCH] Btrfs: fix crash during unmount due to race with delayed" failed to apply to 4.14-stable tree Message-ID: <20200415172821.GI1068@sasha-vm> References: <1586873688207222@kroah.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <1586873688207222@kroah.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org On Tue, Apr 14, 2020 at 04:14:48PM +0200, gregkh@linuxfoundation.org wrote: > >The patch below does not apply to the 4.14-stable tree. >If someone wants it applied there, or to any other stable or longterm >tree, then please email the backport, including the original git commit >id to . > >thanks, > >greg k-h > >------------------ original commit in Linus's tree ------------------ > >>From f0cc2cd70164efe8f75c5d99560f0f69969c72e4 Mon Sep 17 00:00:00 2001 >From: Filipe Manana >Date: Fri, 28 Feb 2020 13:04:36 +0000 >Subject: [PATCH] Btrfs: fix crash during unmount due to race with delayed > inode workers > >During unmount we can have a job from the delayed inode items work queue >still running, that can lead to at least two bad things: > >1) A crash, because the worker can try to create a transaction just > after the fs roots were freed; > >2) A transaction leak, because the worker can create a transaction > before the fs roots are freed and just after we committed the last > transaction and after we stopped the transaction kthread. > >A stack trace example of the crash: > > [79011.691214] kernel BUG at lib/radix-tree.c:982! > [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI > [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2 > (...) > [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs] > [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170 > (...) > [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293 > [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0 > [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0 > [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000 > [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001 > [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000 > [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0 > [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [79011.715448] Call Trace: > [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs] > [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs] > [79011.717925] start_transaction+0xdd/0x5c0 [btrfs] > [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs] > [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs] > [79011.720773] process_one_work+0x26d/0x6a0 > [79011.721497] worker_thread+0x4f/0x3e0 > [79011.722153] ? process_one_work+0x6a0/0x6a0 > [79011.722901] kthread+0x103/0x140 > [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70 > [79011.724379] ret_from_fork+0x3a/0x50 > (...) > >The following diagram shows a sequence of steps that lead to the crash >during ummount of the filesystem: > > CPU 1 CPU 2 CPU 3 > > btrfs_punch_hole() > btrfs_btree_balance_dirty() > btrfs_balance_delayed_items() > --> sees > fs_info->delayed_root->items > with value 200, which is greater > than > BTRFS_DELAYED_BACKGROUND (128) > and smaller than > BTRFS_DELAYED_WRITEBACK (512) > btrfs_wq_run_delayed_node() > --> queues a job for > fs_info->delayed_workers to run > btrfs_async_run_delayed_root() > > btrfs_async_run_delayed_root() > --> job queued by CPU 1 > > --> starts picking and running > delayed nodes from the > prepare_list list > > close_ctree() > > btrfs_delete_unused_bgs() > > btrfs_commit_super() > > btrfs_join_transaction() > --> gets transaction N > > btrfs_commit_transaction(N) > --> set transaction state > to TRANTS_STATE_COMMIT_START > > btrfs_first_prepared_delayed_node() > --> picks delayed node X through > the prepared_list list > > btrfs_run_delayed_items() > > btrfs_first_delayed_node() > --> also picks delayed node X > but through the node_list > list > > __btrfs_commit_inode_delayed_items() > --> runs all delayed items from > this node and drops the > node's item count to 0 > through call to > btrfs_release_delayed_inode() > > --> finishes running any remaining > delayed nodes > > --> finishes transaction commit > > --> stops cleaner and transaction threads > > btrfs_free_fs_roots() > --> frees all roots and removes them > from the radix tree > fs_info->fs_roots_radix > > btrfs_join_transaction() > start_transaction() > btrfs_record_root_in_trans() > record_root_in_trans() > radix_tree_tag_set() > --> crashes because > the root is not in > the radix tree > anymore > >If the worker is able to call btrfs_join_transaction() before the unmount >task frees the fs roots, we end up leaking a transaction and all its >resources, since after the call to btrfs_commit_super() and stopping the >transaction kthread, we don't expect to have any transaction open anymore. > >When this situation happens the worker has a delayed node that has no >more items to run, since the task calling btrfs_run_delayed_items(), >which is doing a transaction commit, picks the same node and runs all >its items first. > >We can not wait for the worker to complete when running delayed items >through btrfs_run_delayed_items(), because we call that function in >several phases of a transaction commit, and that could cause a deadlock >because the worker calls btrfs_join_transaction() and the task doing the >transaction commit may have already set the transaction state to >TRANS_STATE_COMMIT_DOING. > >Also it's not possible to get into a situation where only some of the >items of a delayed node are added to the fs/subvolume tree in the current >transaction and the remaining ones in the next transaction, because when >running the items of a delayed inode we lock its mutex, effectively >waiting for the worker if the worker is running the items of the delayed >node already. > >Since this can only cause issues when unmounting a filesystem, fix it in >a simple way by waiting for any jobs on the delayed workers queue before >calling btrfs_commit_supper() at close_ctree(). This works because at this >point no one can call btrfs_btree_balance_dirty() or >btrfs_balance_delayed_items(), and if we end up waiting for any worker to >complete, btrfs_commit_super() will commit the transaction created by the >worker. > >CC: stable@vger.kernel.org # 4.4+ >Signed-off-by: Filipe Manana >Reviewed-by: David Sterba >Signed-off-by: David Sterba There were context conflicts, such as with e1f60a6580c0 ("btrfs: add __pure attribute to functions") going back. I've fixed those and queued this patch. -- Thanks, Sasha