From: Daniel Thompson <daniel.thompson@linaro.org>
To: liu.yec@h3c.com
Cc: jirislaby@kernel.org, dianders@chromium.org,
gregkh@linuxfoundation.org, jason.wessel@windriver.com,
kgdb-bugreport@lists.sourceforge.net,
linux-kernel@vger.kernel.org, linux-serial@vger.kernel.org
Subject: Re: [PATCH V5] kdb: Fix the deadlock issue in KDB debugging.
Date: Mon, 25 Mar 2024 16:54:36 +0000 [thread overview]
Message-ID: <20240325165436.GA485978@aspen.lan> (raw)
In-Reply-To: <20240323014141.3621738-1-liu.yec@h3c.com>
On Sat, Mar 23, 2024 at 09:41:41AM +0800, liu.yec@h3c.com wrote:
> From: LiuYe <liu.yeC@h3c.com>
>
> Currently, if CONFIG_KDB_KEYBOARD is enabled, then kgdboc will
> attempt to use schedule_work() to provoke a keyboard reset when
> transitioning out of the debugger and back to normal operation.
> This can cause deadlock because schedule_work() is not NMI-safe.
>
> The stack trace below shows an example of the problem. In this
> case the master cpu is not running from NMI but it has parked
> the slave CPUs using an NMI and the parked CPUs is holding
> spinlocks needed by schedule_work().
>
> example:
> BUG: spinlock lockup suspected on CPU#0, namex/10450
> lock: 0xffff881ffe823980, .magic: dead4ead, .owner: namexx/21888, .owner_cpu: 1
> ffff881741d00000 ffff881741c01000 0000000000000000 0000000000000000
> ffff881740f58e78 ffff881741cffdd0 ffffffff8147a7fc ffff881740f58f20
> Call Trace:
> [<ffffffff81479e6d>] ? __schedule+0x16d/0xac0
> [<ffffffff8147a7fc>] ? schedule+0x3c/0x90
> [<ffffffff8147e71a>] ? schedule_hrtimeout_range_clock+0x10a/0x120
> [<ffffffff8147d22e>] ? mutex_unlock+0xe/0x10
> [<ffffffff811c839b>] ? ep_scan_ready_list+0x1db/0x1e0
> [<ffffffff8147e743>] ? schedule_hrtimeout_range+0x13/0x20
> [<ffffffff811c864a>] ? ep_poll+0x27a/0x3b0
> [<ffffffff8108c540>] ? wake_up_q+0x70/0x70
> [<ffffffff811c99a8>] ? SyS_epoll_wait+0xb8/0xd0
> [<ffffffff8147f296>] ? entry_SYSCALL_64_fastpath+0x12/0x75
> CPU: 0 PID: 10450 Comm: namex Tainted: G O 4.4.65 #1
> Hardware name: Insyde Purley/Type2 - Board Product Name1, BIOS 05.21.51.0036 07/19/2019
> 0000000000000000 ffff881ffe813c10 ffffffff8124e883 ffff881741c01000
> ffff881ffe823980 ffff881ffe813c38 ffffffff810a7f7f ffff881ffe823980
> 000000007d2b7cd0 0000000000000001 ffff881ffe813c68 ffffffff810a80e0
> Call Trace:
> <#DB> [<ffffffff8124e883>] dump_stack+0x85/0xc2
> [<ffffffff810a7f7f>] spin_dump+0x7f/0x100
> [<ffffffff810a80e0>] do_raw_spin_lock+0xa0/0x150
> [<ffffffff8147eb55>] _raw_spin_lock+0x15/0x20
> [<ffffffff8108c256>] try_to_wake_up+0x176/0x3d0
> [<ffffffff8108c4c5>] wake_up_process+0x15/0x20
> [<ffffffff8107b371>] insert_work+0x81/0xc0
> [<ffffffff8107b4e5>] __queue_work+0x135/0x390
> [<ffffffff8107b786>] queue_work_on+0x46/0x90
> [<ffffffff81313d28>] kgdboc_post_exp_handler+0x48/0x70
> [<ffffffff810ed488>] kgdb_cpu_enter+0x598/0x610
> [<ffffffff810ed6e2>] kgdb_handle_exception+0xf2/0x1f0
> [<ffffffff81054e21>] __kgdb_notify+0x71/0xd0
> [<ffffffff81054eb5>] kgdb_notify+0x35/0x70
> [<ffffffff81082e6a>] notifier_call_chain+0x4a/0x70
> [<ffffffff8108304d>] notify_die+0x3d/0x50
> [<ffffffff81017219>] do_int3+0x89/0x120
> [<ffffffff81480fb4>] int3+0x44/0x80
>
> Just need to postpone schedule_work to the slave CPU exiting the NMI context.
>
> irq_work will only respond to handle schedule_work after exiting the current interrupt context.
>
> When the master CPU exits the interrupt context, other CPUs will naturally exit the NMI context, so there will be no deadlock.
>
> It is the call to input_register_handler() that forces us not to do the work from irq_work's hardirq callback.
>
> Therefore schedule another work in the irq_work and not do the job directly.
This looks like it was copy and pasted from the e-mail thread without
any editing to make it make any sense. It not even formatted correctly
(where are the line breaks?).
How about:
We fix the problem by using irq_work to call schedule_work()
instead of calling it directly. irq_work is an NMI-safe deferred work
framework that performs the requested work from a hardirq context
(usually an IPI but it can be timer interrupt on some
architectures).
Note that we still need to a workqueue since we cannot resync
the keyboard state from the hardirq context provided by irq_work.
That must be done from task context for the calls into the input
subystem. Hence we must defer the work twice. First to safely
switch from the debug trap (NMI-like context) to hardirq and
then, secondly, to get from hardirq to the system workqueue.
Daniel.
next prev parent reply other threads:[~2024-03-25 16:54 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-28 2:56 [PATCH] kdb: Fix the deadlock issue in KDB debugging LiuYe
2024-02-28 12:05 ` Daniel Thompson
2024-03-01 3:30 ` 答复: " Liuye
2024-03-01 10:59 ` Daniel Thompson
2024-03-12 8:37 ` 答复: " Liuye
2024-03-12 9:57 ` Daniel Thompson
2024-03-12 10:04 ` 答复: " Liuye
2024-03-12 10:24 ` Daniel Thompson
2024-03-13 1:22 ` 答复: " Liuye
2024-03-13 14:17 ` Daniel Thompson
2024-03-14 7:06 ` 答复: " Liuye
2024-03-14 13:09 ` Daniel Thompson
2024-03-15 9:59 ` 答复: " Liuye
2024-03-16 2:34 ` [PATCH v1] " liu.yec
2024-03-20 16:28 ` Daniel Thompson
2024-03-21 2:26 ` [PATCH V3] " liu.yec
2024-03-21 7:38 ` Greg KH
2024-03-21 7:57 ` 答复: " Liuye
2024-03-21 11:04 ` Daniel Thompson
2024-03-21 11:50 ` [PATCH V4] " liu.yec
2024-03-22 6:54 ` Jiri Slaby
2024-03-22 7:50 ` 答复: " Liuye
2024-03-22 15:58 ` Daniel Thompson
2024-03-23 1:41 ` [PATCH V5] " liu.yec
2024-03-25 16:54 ` Daniel Thompson [this message]
2024-03-26 0:47 ` 答复: " Liuye
2024-03-26 7:40 ` [PATCH V6] " liu.yec
2024-03-26 8:22 ` Greg KH
2024-03-26 8:54 ` [PATCH V7] " liu.yec
2024-04-02 12:58 ` Daniel Thompson
2024-04-03 6:11 ` [PATCH V8] " liu.yec
2024-04-03 13:58 ` Daniel Thompson
2024-04-03 22:22 ` Andy Shevchenko
2024-04-08 1:44 ` LiuYe
2024-04-08 10:29 ` Andy Shevchenko
2024-04-09 2:03 ` [PATCH V9] " liu.yec
2024-04-10 2:06 ` [PATCH V10] " liu.yec
2024-04-10 3:59 ` Andy Shevchenko
2024-04-10 5:30 ` Greg KH
2024-04-10 5:54 ` 答复: " Liuye
2024-04-10 5:59 ` Greg KH
2024-04-10 6:10 ` 答复: " Liuye
2024-04-10 6:15 ` Greg KH
2024-04-10 6:30 ` 答复: " Liuye
2024-04-10 7:18 ` [PATCH V11] " liu.yec
2024-04-10 8:24 ` 答复: 答复: 答复: [PATCH V10] " Greg KH
2024-04-10 8:38 ` 答复: " Liuye
2024-03-02 20:44 ` [PATCH] " Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240325165436.GA485978@aspen.lan \
--to=daniel.thompson@linaro.org \
--cc=dianders@chromium.org \
--cc=gregkh@linuxfoundation.org \
--cc=jason.wessel@windriver.com \
--cc=jirislaby@kernel.org \
--cc=kgdb-bugreport@lists.sourceforge.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-serial@vger.kernel.org \
--cc=liu.yec@h3c.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox