From: Sasha Levin <sashal@kernel.org>
To: gregkh@linuxfoundation.org
Cc: fdmanana@suse.com, dsterba@suse.com, stable@vger.kernel.org
Subject: Re: FAILED: patch "[PATCH] Btrfs: fix crash during unmount due to race with delayed" failed to apply to 4.14-stable tree
Date: Wed, 15 Apr 2020 13:28:21 -0400 [thread overview]
Message-ID: <20200415172821.GI1068@sasha-vm> (raw)
In-Reply-To: <1586873688207222@kroah.com>
On Tue, Apr 14, 2020 at 04:14:48PM +0200, gregkh@linuxfoundation.org wrote:
>
>The patch below does not apply to the 4.14-stable tree.
>If someone wants it applied there, or to any other stable or longterm
>tree, then please email the backport, including the original git commit
>id to <stable@vger.kernel.org>.
>
>thanks,
>
>greg k-h
>
>------------------ original commit in Linus's tree ------------------
>
>From f0cc2cd70164efe8f75c5d99560f0f69969c72e4 Mon Sep 17 00:00:00 2001
>From: Filipe Manana <fdmanana@suse.com>
>Date: Fri, 28 Feb 2020 13:04:36 +0000
>Subject: [PATCH] Btrfs: fix crash during unmount due to race with delayed
> inode workers
>
>During unmount we can have a job from the delayed inode items work queue
>still running, that can lead to at least two bad things:
>
>1) A crash, because the worker can try to create a transaction just
> after the fs roots were freed;
>
>2) A transaction leak, because the worker can create a transaction
> before the fs roots are freed and just after we committed the last
> transaction and after we stopped the transaction kthread.
>
>A stack trace example of the crash:
>
> [79011.691214] kernel BUG at lib/radix-tree.c:982!
> [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2
> (...)
> [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
> [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
> (...)
> [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
> [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
> [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
> [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
> [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
> [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
> [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
> [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [79011.715448] Call Trace:
> [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs]
> [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
> [79011.717925] start_transaction+0xdd/0x5c0 [btrfs]
> [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
> [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs]
> [79011.720773] process_one_work+0x26d/0x6a0
> [79011.721497] worker_thread+0x4f/0x3e0
> [79011.722153] ? process_one_work+0x6a0/0x6a0
> [79011.722901] kthread+0x103/0x140
> [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70
> [79011.724379] ret_from_fork+0x3a/0x50
> (...)
>
>The following diagram shows a sequence of steps that lead to the crash
>during ummount of the filesystem:
>
> CPU 1 CPU 2 CPU 3
>
> btrfs_punch_hole()
> btrfs_btree_balance_dirty()
> btrfs_balance_delayed_items()
> --> sees
> fs_info->delayed_root->items
> with value 200, which is greater
> than
> BTRFS_DELAYED_BACKGROUND (128)
> and smaller than
> BTRFS_DELAYED_WRITEBACK (512)
> btrfs_wq_run_delayed_node()
> --> queues a job for
> fs_info->delayed_workers to run
> btrfs_async_run_delayed_root()
>
> btrfs_async_run_delayed_root()
> --> job queued by CPU 1
>
> --> starts picking and running
> delayed nodes from the
> prepare_list list
>
> close_ctree()
>
> btrfs_delete_unused_bgs()
>
> btrfs_commit_super()
>
> btrfs_join_transaction()
> --> gets transaction N
>
> btrfs_commit_transaction(N)
> --> set transaction state
> to TRANTS_STATE_COMMIT_START
>
> btrfs_first_prepared_delayed_node()
> --> picks delayed node X through
> the prepared_list list
>
> btrfs_run_delayed_items()
>
> btrfs_first_delayed_node()
> --> also picks delayed node X
> but through the node_list
> list
>
> __btrfs_commit_inode_delayed_items()
> --> runs all delayed items from
> this node and drops the
> node's item count to 0
> through call to
> btrfs_release_delayed_inode()
>
> --> finishes running any remaining
> delayed nodes
>
> --> finishes transaction commit
>
> --> stops cleaner and transaction threads
>
> btrfs_free_fs_roots()
> --> frees all roots and removes them
> from the radix tree
> fs_info->fs_roots_radix
>
> btrfs_join_transaction()
> start_transaction()
> btrfs_record_root_in_trans()
> record_root_in_trans()
> radix_tree_tag_set()
> --> crashes because
> the root is not in
> the radix tree
> anymore
>
>If the worker is able to call btrfs_join_transaction() before the unmount
>task frees the fs roots, we end up leaking a transaction and all its
>resources, since after the call to btrfs_commit_super() and stopping the
>transaction kthread, we don't expect to have any transaction open anymore.
>
>When this situation happens the worker has a delayed node that has no
>more items to run, since the task calling btrfs_run_delayed_items(),
>which is doing a transaction commit, picks the same node and runs all
>its items first.
>
>We can not wait for the worker to complete when running delayed items
>through btrfs_run_delayed_items(), because we call that function in
>several phases of a transaction commit, and that could cause a deadlock
>because the worker calls btrfs_join_transaction() and the task doing the
>transaction commit may have already set the transaction state to
>TRANS_STATE_COMMIT_DOING.
>
>Also it's not possible to get into a situation where only some of the
>items of a delayed node are added to the fs/subvolume tree in the current
>transaction and the remaining ones in the next transaction, because when
>running the items of a delayed inode we lock its mutex, effectively
>waiting for the worker if the worker is running the items of the delayed
>node already.
>
>Since this can only cause issues when unmounting a filesystem, fix it in
>a simple way by waiting for any jobs on the delayed workers queue before
>calling btrfs_commit_supper() at close_ctree(). This works because at this
>point no one can call btrfs_btree_balance_dirty() or
>btrfs_balance_delayed_items(), and if we end up waiting for any worker to
>complete, btrfs_commit_super() will commit the transaction created by the
>worker.
>
>CC: stable@vger.kernel.org # 4.4+
>Signed-off-by: Filipe Manana <fdmanana@suse.com>
>Reviewed-by: David Sterba <dsterba@suse.com>
>Signed-off-by: David Sterba <dsterba@suse.com>
There were context conflicts, such as with e1f60a6580c0 ("btrfs: add
__pure attribute to functions") going back. I've fixed those and queued
this patch.
--
Thanks,
Sasha
prev parent reply other threads:[~2020-04-15 17:28 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-14 14:14 FAILED: patch "[PATCH] Btrfs: fix crash during unmount due to race with delayed" failed to apply to 4.14-stable tree gregkh
2020-04-15 17:28 ` Sasha Levin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200415172821.GI1068@sasha-vm \
--to=sashal@kernel.org \
--cc=dsterba@suse.com \
--cc=fdmanana@suse.com \
--cc=gregkh@linuxfoundation.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.