All of lore.kernel.org
 help / color / mirror / Atom feed
From: Keith Busch <kbusch@kernel.org>
To: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Kamaljit Singh <Kamaljit.Singh1@wdc.com>,
	Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	Damien Le Moal <dlemoal@kernel.org>
Subject: Re: parallel nvme format
Date: Thu, 14 Dec 2023 09:19:03 -0800	[thread overview]
Message-ID: <ZXs5B8pYKTBLD1NH@kbusch-mbp> (raw)
In-Reply-To: <ZXsJdJxDXEbCmSec@x1-carbon>

On Thu, Dec 14, 2023 at 01:56:05PM +0000, Niklas Cassel wrote:
> Hello guys,
> 
> We have observed the following on linux stable v6.6.x.
> 
> (We tried to backport commit 839a40d1e730 ("nvme: fix deadlock between reset
> and scan"), but it didn't fix the problem.)
> 
> 
> What happens is that we are formatting 8 different namespaces in parallel.
> 
> [Wed Nov 29 16:43:56 2023] INFO: task nvme:18411 blocked for more than 120 seconds.
> [Wed Nov 29 16:43:56 2023]       Not tainted 6.6.2 #1
> [Wed Nov 29 16:43:56 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> +disables this message.
> [Wed Nov 29 16:43:56 2023] task:nvme            state:D stack:0     pid:18411
> +ppid:18405  flags:0x00004000
> [Wed Nov 29 16:43:56 2023] Call Trace:
> [Wed Nov 29 16:43:56 2023]  <TASK>
> [Wed Nov 29 16:43:56 2023]  __schedule+0x34a/0xef0
> [Wed Nov 29 16:43:56 2023]  ? preempt_count_add+0x7c/0xc0
> [Wed Nov 29 16:43:56 2023]  ? _raw_spin_lock_irqsave+0x29/0x60
> [Wed Nov 29 16:43:56 2023]  schedule+0x6b/0xf0
> [Wed Nov 29 16:43:56 2023]  schedule_timeout+0x233/0x290
> [Wed Nov 29 16:43:56 2023]  ? __timer_delete_sync+0x31/0x50   
> [Wed Nov 29 16:43:56 2023]  ? schedule_timeout+0x1a3/0x290
> [Wed Nov 29 16:43:56 2023]  wait_for_completion+0x8b/0x140
> [Wed Nov 29 16:43:56 2023]  __flush_work+0x139/0x240
> [Wed Nov 29 16:43:56 2023]  ? __pfx_wq_barrier_func+0x10/0x10
> [Wed Nov 29 16:43:56 2023]  flush_work+0x10/0x20
> [Wed Nov 29 16:43:56 2023]  nvme_passthru_end+0x60/0x170 [nvme_core]
> [Wed Nov 29 16:43:56 2023]  nvme_submit_user_cmd+0x162/0x170 [nvme_core]
> [Wed Nov 29 16:43:56 2023]  nvme_user_cmd.constprop.0+0x156/0x1d0 [nvme_core]
> [Wed Nov 29 16:43:56 2023]  nvme_dev_ioctl+0x12c/0x2a0 [nvme_core]
> [Wed Nov 29 16:43:56 2023]  __x64_sys_ioctl+0x95/0xd0
> [Wed Nov 29 16:43:56 2023]  do_syscall_64+0x38/0x90
> [Wed Nov 29 16:43:56 2023]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> 
> flush_work() waits for more that 120 seconds on scan_work getting completed,
> but even after 120 seconds, scan_work hasn't completed.
> 
> Inspecting nvme_scan_work(), the only thing that can cause it to not return
> after 120 seconds, is if it is waiting on mutex_lock() to get the scan mutex.
> 
> 
> What appears to be happening is:
> 
> format command #1 completes and calls nvme_passthru_end(),
> nvme_passthru_end() calls:
> 
> mutex_unlock(&ctrl->scan_lock);
> ...
> nvme_queue_scan(ctrl);
> flush_work(&ctrl->scan_work);
> 
> 
> Directly after format command #1 unlocks the mutex,
> format command #2 will get the mutex in nvme_passthru_start().
> 
> Format command #2 will release the mutex in nvme_passthru_end(),
> but only after the format command has completed.
> 
> The default timeout for the nvme format command via nvme-cli is 600 seconds,
> so it is easy to imagine that the format command will take longer than 120
> seconds.
> 
> 
> So the scan work at the end of format command #1 will eventually get to run,
> but only after the second nvme format command has completed.
> (nvme_queue_scan() will finally be able to get the mutex, once
> nvme_passthru_end() for the second format command has released the mutex.)
> 
> 
> I don't think that formatting multiple namespaces in parallel is an
> unreasonable thing to do, so I do not think that we should scare our
> users with "hung task timeouts".
> 
> 
> 
> My first though was that perhaps we could do something like:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 46a4c9c5ea96..811d40cf23f6 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1113,6 +1113,7 @@ u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode)
>          * effects say only one namespace is affected.
>          */
>         if (effects & NVME_CMD_EFFECTS_CSE_MASK) {
> +               flush_work(&ctrl->scan_work);
>                 mutex_lock(&ctrl->scan_lock);
>                 mutex_lock(&ctrl->subsys->lock);
>                 nvme_mpath_start_freeze(ctrl->subsys);
> 
> 
> But that will not work, because it will still have a race window between the
> time nvme_passthru_end() has unlocked the mutex, and before
> nvme_passthru_end() has queued the scan work.
> 
> I guess, in the case that nvme_passthru_end() has to perform a scan, we want
> to unlock the scan mutex after nvme_passthru_end() has performed the scan,
> but right now nvme_queue_scan() will acquire the mutex (and is running on the
> workqueue)...
> 
> Thoughts?

Hi Niklas!

Thanks for the report and detailed analysis. What do you think about
this idea: if scan_lock is already locked, then some other thread is
expecting to requeue scan_work shortly, so no need to proceed with the
scan.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 590cd4f097c22..c6317489022c3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3945,7 +3945,13 @@ static void nvme_scan_work(struct work_struct *work)
 		nvme_clear_changed_ns_log(ctrl);
 	}
 
-	mutex_lock(&ctrl->scan_lock);
+	/*
+	 * Failure to aquire scan_lock means another thread will requeue this
+	 * work shortly, so abandon the current scan.
+	 */
+	if (!mutex_trylock(&ctrl->scan_lock))
+		return;
+
 	if (nvme_ctrl_limited_cns(ctrl)) {
 		nvme_scan_ns_sequential(ctrl);
 	} else {
--


  reply	other threads:[~2023-12-14 17:19 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14 13:56 parallel nvme format Niklas Cassel
2023-12-14 17:19 ` Keith Busch [this message]
2023-12-14 21:21   ` Niklas Cassel
2023-12-14 22:49     ` Keith Busch
2023-12-14 23:08       ` Keith Busch
     [not found]         ` <BYAPR04MB4151B88FACA25A38C4C998FCBC8CA@BYAPR04MB4151.namprd04.prod.outlook.com>
2023-12-14 23:37           ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZXs5B8pYKTBLD1NH@kbusch-mbp \
    --to=kbusch@kernel.org \
    --cc=Kamaljit.Singh1@wdc.com \
    --cc=Niklas.Cassel@wdc.com \
    --cc=dlemoal@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.