From: Keith Busch <kbusch@kernel.org>
To: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
Kamaljit Singh <Kamaljit.Singh1@wdc.com>,
Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
Damien Le Moal <dlemoal@kernel.org>
Subject: Re: parallel nvme format
Date: Thu, 14 Dec 2023 09:19:03 -0800 [thread overview]
Message-ID: <ZXs5B8pYKTBLD1NH@kbusch-mbp> (raw)
In-Reply-To: <ZXsJdJxDXEbCmSec@x1-carbon>
On Thu, Dec 14, 2023 at 01:56:05PM +0000, Niklas Cassel wrote:
> Hello guys,
>
> We have observed the following on linux stable v6.6.x.
>
> (We tried to backport commit 839a40d1e730 ("nvme: fix deadlock between reset
> and scan"), but it didn't fix the problem.)
>
>
> What happens is that we are formatting 8 different namespaces in parallel.
>
> [Wed Nov 29 16:43:56 2023] INFO: task nvme:18411 blocked for more than 120 seconds.
> [Wed Nov 29 16:43:56 2023] Not tainted 6.6.2 #1
> [Wed Nov 29 16:43:56 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> +disables this message.
> [Wed Nov 29 16:43:56 2023] task:nvme state:D stack:0 pid:18411
> +ppid:18405 flags:0x00004000
> [Wed Nov 29 16:43:56 2023] Call Trace:
> [Wed Nov 29 16:43:56 2023] <TASK>
> [Wed Nov 29 16:43:56 2023] __schedule+0x34a/0xef0
> [Wed Nov 29 16:43:56 2023] ? preempt_count_add+0x7c/0xc0
> [Wed Nov 29 16:43:56 2023] ? _raw_spin_lock_irqsave+0x29/0x60
> [Wed Nov 29 16:43:56 2023] schedule+0x6b/0xf0
> [Wed Nov 29 16:43:56 2023] schedule_timeout+0x233/0x290
> [Wed Nov 29 16:43:56 2023] ? __timer_delete_sync+0x31/0x50
> [Wed Nov 29 16:43:56 2023] ? schedule_timeout+0x1a3/0x290
> [Wed Nov 29 16:43:56 2023] wait_for_completion+0x8b/0x140
> [Wed Nov 29 16:43:56 2023] __flush_work+0x139/0x240
> [Wed Nov 29 16:43:56 2023] ? __pfx_wq_barrier_func+0x10/0x10
> [Wed Nov 29 16:43:56 2023] flush_work+0x10/0x20
> [Wed Nov 29 16:43:56 2023] nvme_passthru_end+0x60/0x170 [nvme_core]
> [Wed Nov 29 16:43:56 2023] nvme_submit_user_cmd+0x162/0x170 [nvme_core]
> [Wed Nov 29 16:43:56 2023] nvme_user_cmd.constprop.0+0x156/0x1d0 [nvme_core]
> [Wed Nov 29 16:43:56 2023] nvme_dev_ioctl+0x12c/0x2a0 [nvme_core]
> [Wed Nov 29 16:43:56 2023] __x64_sys_ioctl+0x95/0xd0
> [Wed Nov 29 16:43:56 2023] do_syscall_64+0x38/0x90
> [Wed Nov 29 16:43:56 2023] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> flush_work() waits for more that 120 seconds on scan_work getting completed,
> but even after 120 seconds, scan_work hasn't completed.
>
> Inspecting nvme_scan_work(), the only thing that can cause it to not return
> after 120 seconds, is if it is waiting on mutex_lock() to get the scan mutex.
>
>
> What appears to be happening is:
>
> format command #1 completes and calls nvme_passthru_end(),
> nvme_passthru_end() calls:
>
> mutex_unlock(&ctrl->scan_lock);
> ...
> nvme_queue_scan(ctrl);
> flush_work(&ctrl->scan_work);
>
>
> Directly after format command #1 unlocks the mutex,
> format command #2 will get the mutex in nvme_passthru_start().
>
> Format command #2 will release the mutex in nvme_passthru_end(),
> but only after the format command has completed.
>
> The default timeout for the nvme format command via nvme-cli is 600 seconds,
> so it is easy to imagine that the format command will take longer than 120
> seconds.
>
>
> So the scan work at the end of format command #1 will eventually get to run,
> but only after the second nvme format command has completed.
> (nvme_queue_scan() will finally be able to get the mutex, once
> nvme_passthru_end() for the second format command has released the mutex.)
>
>
> I don't think that formatting multiple namespaces in parallel is an
> unreasonable thing to do, so I do not think that we should scare our
> users with "hung task timeouts".
>
>
>
> My first though was that perhaps we could do something like:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 46a4c9c5ea96..811d40cf23f6 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1113,6 +1113,7 @@ u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode)
> * effects say only one namespace is affected.
> */
> if (effects & NVME_CMD_EFFECTS_CSE_MASK) {
> + flush_work(&ctrl->scan_work);
> mutex_lock(&ctrl->scan_lock);
> mutex_lock(&ctrl->subsys->lock);
> nvme_mpath_start_freeze(ctrl->subsys);
>
>
> But that will not work, because it will still have a race window between the
> time nvme_passthru_end() has unlocked the mutex, and before
> nvme_passthru_end() has queued the scan work.
>
> I guess, in the case that nvme_passthru_end() has to perform a scan, we want
> to unlock the scan mutex after nvme_passthru_end() has performed the scan,
> but right now nvme_queue_scan() will acquire the mutex (and is running on the
> workqueue)...
>
> Thoughts?
Hi Niklas!
Thanks for the report and detailed analysis. What do you think about
this idea: if scan_lock is already locked, then some other thread is
expecting to requeue scan_work shortly, so no need to proceed with the
scan.
---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 590cd4f097c22..c6317489022c3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3945,7 +3945,13 @@ static void nvme_scan_work(struct work_struct *work)
nvme_clear_changed_ns_log(ctrl);
}
- mutex_lock(&ctrl->scan_lock);
+ /*
+ * Failure to aquire scan_lock means another thread will requeue this
+ * work shortly, so abandon the current scan.
+ */
+ if (!mutex_trylock(&ctrl->scan_lock))
+ return;
+
if (nvme_ctrl_limited_cns(ctrl)) {
nvme_scan_ns_sequential(ctrl);
} else {
--
next prev parent reply other threads:[~2023-12-14 17:19 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-14 13:56 parallel nvme format Niklas Cassel
2023-12-14 17:19 ` Keith Busch [this message]
2023-12-14 21:21 ` Niklas Cassel
2023-12-14 22:49 ` Keith Busch
2023-12-14 23:08 ` Keith Busch
[not found] ` <BYAPR04MB4151B88FACA25A38C4C998FCBC8CA@BYAPR04MB4151.namprd04.prod.outlook.com>
2023-12-14 23:37 ` Keith Busch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZXs5B8pYKTBLD1NH@kbusch-mbp \
--to=kbusch@kernel.org \
--cc=Kamaljit.Singh1@wdc.com \
--cc=Niklas.Cassel@wdc.com \
--cc=dlemoal@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=shinichiro.kawasaki@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox