stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: gregkh@linuxfoundation.org
Cc: fdmanana@suse.com, dsterba@suse.com, stable@vger.kernel.org
Subject: Re: FAILED: patch "[PATCH] Btrfs: fix crash during unmount due to race with delayed" failed to apply to 4.14-stable tree
Date: Wed, 15 Apr 2020 13:28:21 -0400	[thread overview]
Message-ID: <20200415172821.GI1068@sasha-vm> (raw)
In-Reply-To: <1586873688207222@kroah.com>

On Tue, Apr 14, 2020 at 04:14:48PM +0200, gregkh@linuxfoundation.org wrote:
>
>The patch below does not apply to the 4.14-stable tree.
>If someone wants it applied there, or to any other stable or longterm
>tree, then please email the backport, including the original git commit
>id to <stable@vger.kernel.org>.
>
>thanks,
>
>greg k-h
>
>------------------ original commit in Linus's tree ------------------
>
>From f0cc2cd70164efe8f75c5d99560f0f69969c72e4 Mon Sep 17 00:00:00 2001
>From: Filipe Manana <fdmanana@suse.com>
>Date: Fri, 28 Feb 2020 13:04:36 +0000
>Subject: [PATCH] Btrfs: fix crash during unmount due to race with delayed
> inode workers
>
>During unmount we can have a job from the delayed inode items work queue
>still running, that can lead to at least two bad things:
>
>1) A crash, because the worker can try to create a transaction just
>   after the fs roots were freed;
>
>2) A transaction leak, because the worker can create a transaction
>   before the fs roots are freed and just after we committed the last
>   transaction and after we stopped the transaction kthread.
>
>A stack trace example of the crash:
>
> [79011.691214] kernel BUG at lib/radix-tree.c:982!
> [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G        W         5.6.0-rc2-btrfs-next-54 #2
> (...)
> [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
> [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
> (...)
> [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
> [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
> [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
> [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
> [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
> [79011.709270] FS:  0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
> [79011.710699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
> [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [79011.715448] Call Trace:
> [79011.715925]  record_root_in_trans+0x72/0xf0 [btrfs]
> [79011.716819]  btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
> [79011.717925]  start_transaction+0xdd/0x5c0 [btrfs]
> [79011.718829]  btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
> [79011.719915]  btrfs_work_helper+0xaa/0x720 [btrfs]
> [79011.720773]  process_one_work+0x26d/0x6a0
> [79011.721497]  worker_thread+0x4f/0x3e0
> [79011.722153]  ? process_one_work+0x6a0/0x6a0
> [79011.722901]  kthread+0x103/0x140
> [79011.723481]  ? kthread_create_worker_on_cpu+0x70/0x70
> [79011.724379]  ret_from_fork+0x3a/0x50
> (...)
>
>The following diagram shows a sequence of steps that lead to the crash
>during ummount of the filesystem:
>
>        CPU 1                                             CPU 2                                CPU 3
>
> btrfs_punch_hole()
>   btrfs_btree_balance_dirty()
>     btrfs_balance_delayed_items()
>       --> sees
>           fs_info->delayed_root->items
>           with value 200, which is greater
>           than
>           BTRFS_DELAYED_BACKGROUND (128)
>           and smaller than
>           BTRFS_DELAYED_WRITEBACK (512)
>       btrfs_wq_run_delayed_node()
>         --> queues a job for
>             fs_info->delayed_workers to run
>             btrfs_async_run_delayed_root()
>
>                                                                                            btrfs_async_run_delayed_root()
>                                                                                              --> job queued by CPU 1
>
>                                                                                              --> starts picking and running
>                                                                                                  delayed nodes from the
>                                                                                                  prepare_list list
>
>                                                 close_ctree()
>
>                                                   btrfs_delete_unused_bgs()
>
>                                                   btrfs_commit_super()
>
>                                                     btrfs_join_transaction()
>                                                       --> gets transaction N
>
>                                                     btrfs_commit_transaction(N)
>                                                       --> set transaction state
>                                                        to TRANTS_STATE_COMMIT_START
>
>                                                                                             btrfs_first_prepared_delayed_node()
>                                                                                               --> picks delayed node X through
>                                                                                                   the prepared_list list
>
>                                                       btrfs_run_delayed_items()
>
>                                                         btrfs_first_delayed_node()
>                                                           --> also picks delayed node X
>                                                               but through the node_list
>                                                               list
>
>                                                         __btrfs_commit_inode_delayed_items()
>                                                            --> runs all delayed items from
>                                                                this node and drops the
>                                                                node's item count to 0
>                                                                through call to
>                                                                btrfs_release_delayed_inode()
>
>                                                         --> finishes running any remaining
>                                                             delayed nodes
>
>                                                       --> finishes transaction commit
>
>                                                   --> stops cleaner and transaction threads
>
>                                                   btrfs_free_fs_roots()
>                                                     --> frees all roots and removes them
>                                                         from the radix tree
>                                                         fs_info->fs_roots_radix
>
>                                                                                             btrfs_join_transaction()
>                                                                                               start_transaction()
>                                                                                                 btrfs_record_root_in_trans()
>                                                                                                   record_root_in_trans()
>                                                                                                     radix_tree_tag_set()
>                                                                                                       --> crashes because
>                                                                                                           the root is not in
>                                                                                                           the radix tree
>                                                                                                           anymore
>
>If the worker is able to call btrfs_join_transaction() before the unmount
>task frees the fs roots, we end up leaking a transaction and all its
>resources, since after the call to btrfs_commit_super() and stopping the
>transaction kthread, we don't expect to have any transaction open anymore.
>
>When this situation happens the worker has a delayed node that has no
>more items to run, since the task calling btrfs_run_delayed_items(),
>which is doing a transaction commit, picks the same node and runs all
>its items first.
>
>We can not wait for the worker to complete when running delayed items
>through btrfs_run_delayed_items(), because we call that function in
>several phases of a transaction commit, and that could cause a deadlock
>because the worker calls btrfs_join_transaction() and the task doing the
>transaction commit may have already set the transaction state to
>TRANS_STATE_COMMIT_DOING.
>
>Also it's not possible to get into a situation where only some of the
>items of a delayed node are added to the fs/subvolume tree in the current
>transaction and the remaining ones in the next transaction, because when
>running the items of a delayed inode we lock its mutex, effectively
>waiting for the worker if the worker is running the items of the delayed
>node already.
>
>Since this can only cause issues when unmounting a filesystem, fix it in
>a simple way by waiting for any jobs on the delayed workers queue before
>calling btrfs_commit_supper() at close_ctree(). This works because at this
>point no one can call btrfs_btree_balance_dirty() or
>btrfs_balance_delayed_items(), and if we end up waiting for any worker to
>complete, btrfs_commit_super() will commit the transaction created by the
>worker.
>
>CC: stable@vger.kernel.org # 4.4+
>Signed-off-by: Filipe Manana <fdmanana@suse.com>
>Reviewed-by: David Sterba <dsterba@suse.com>
>Signed-off-by: David Sterba <dsterba@suse.com>

There were context conflicts, such as with e1f60a6580c0 ("btrfs: add
__pure attribute to functions") going back. I've fixed those and queued
this patch.

-- 
Thanks,
Sasha

      reply	other threads:[~2020-04-15 17:28 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-14 14:14 FAILED: patch "[PATCH] Btrfs: fix crash during unmount due to race with delayed" failed to apply to 4.14-stable tree gregkh
2020-04-15 17:28 ` Sasha Levin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200415172821.GI1068@sasha-vm \
    --to=sashal@kernel.org \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).