From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Bitao Hu <yaoma@linux.alibaba.com>,
Guixin Liu <kanie@linux.alibaba.com>,
Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>,
Keith Busch <kbusch@kernel.org>, Sasha Levin <sashal@kernel.org>,
linux-nvme@lists.infradead.org
Subject: [PATCH AUTOSEL 6.6 34/47] nvme: fix deadlock between reset and scan
Date: Mon, 11 Dec 2023 08:50:35 -0500 [thread overview]
Message-ID: <20231211135147.380223-34-sashal@kernel.org> (raw)
In-Reply-To: <20231211135147.380223-1-sashal@kernel.org>
From: Bitao Hu <yaoma@linux.alibaba.com>
[ Upstream commit 839a40d1e730977d4448d141fa653517c2959a88 ]
If controller reset occurs when allocating namespace, both
nvme_reset_work and nvme_scan_work will hang, as shown below.
Test Scripts:
for ((t=1;t<=128;t++))
do
nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \
-d 0 | awk -F: '{print($NF);}'`
nvme attach-ns /dev/nvme1 -n $nsid -c 0
done
nvme reset /dev/nvme1
We will find that both nvme_reset_work and nvme_scan_work hung:
INFO: task kworker/u249:4:17848 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:kworker/u249:4 state:D stack: 0 pid:17848 ppid: 2
flags:0x00000028
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
Call trace:
__switch_to+0xb4/0xfc
__schedule+0x22c/0x670
schedule+0x4c/0xd0
blk_mq_freeze_queue_wait+0x84/0xc0
nvme_wait_freeze+0x40/0x64 [nvme_core]
nvme_reset_work+0x1c0/0x5cc [nvme]
process_one_work+0x1d8/0x4b0
worker_thread+0x230/0x440
kthread+0x114/0x120
INFO: task kworker/u249:3:22404 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:kworker/u249:3 state:D stack: 0 pid:22404 ppid: 2
flags:0x00000028
Workqueue: nvme-wq nvme_scan_work [nvme_core]
Call trace:
__switch_to+0xb4/0xfc
__schedule+0x22c/0x670
schedule+0x4c/0xd0
rwsem_down_write_slowpath+0x32c/0x98c
down_write+0x70/0x80
nvme_alloc_ns+0x1ac/0x38c [nvme_core]
nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core]
nvme_scan_ns_list+0xe8/0x2e4 [nvme_core]
nvme_scan_work+0x60/0x500 [nvme_core]
process_one_work+0x1d8/0x4b0
worker_thread+0x260/0x440
kthread+0x114/0x120
INFO: task nvme:28428 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:nvme state:D stack: 0 pid:28428 ppid: 27119
flags:0x00000000
Call trace:
__switch_to+0xb4/0xfc
__schedule+0x22c/0x670
schedule+0x4c/0xd0
schedule_timeout+0x160/0x194
do_wait_for_common+0xac/0x1d0
__wait_for_common+0x78/0x100
wait_for_completion+0x24/0x30
__flush_work.isra.0+0x74/0x90
flush_work+0x14/0x20
nvme_reset_ctrl_sync+0x50/0x74 [nvme_core]
nvme_dev_ioctl+0x1b0/0x250 [nvme_core]
__arm64_sys_ioctl+0xa8/0xf0
el0_svc_common+0x88/0x234
do_el0_svc+0x7c/0x90
el0_svc+0x1c/0x30
el0_sync_handler+0xa8/0xb0
el0_sync+0x148/0x180
The reason for the hang is that nvme_reset_work occurs while nvme_scan_work
is still running. nvme_scan_work may add new ns into ctrl->namespaces
list after nvme_reset_work frozen all ns->q in ctrl->namespaces list.
The newly added ns is not frozen, so nvme_wait_freeze will wait forever.
Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so
nvme_scan_work will also wait forever. Now we are deadlocked!
PROCESS1 PROCESS2
============== ==============
nvme_scan_work
... nvme_reset_work
nvme_validate_or_alloc_ns nvme_dev_disable
nvme_alloc_ns nvme_start_freeze
down_write ...
nvme_ns_add_to_ctrl_list ...
up_write nvme_wait_freeze
... down_read
nvme_alloc_ns blk_mq_freeze_queue_wait
down_write
Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in
nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check
it before adding the new namespace (under the namespaces_rwsem).
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/nvme/host/core.c | 10 ++++++++++
drivers/nvme/host/nvme.h | 1 +
2 files changed, 11 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4e7bda7749caa..d5bfb6db5a1cc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3641,6 +3641,14 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
goto out_unlink_ns;
down_write(&ctrl->namespaces_rwsem);
+ /*
+ * Ensure that no namespaces are added to the ctrl list after the queues
+ * are frozen, thereby avoiding a deadlock between scan and reset.
+ */
+ if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
+ up_write(&ctrl->namespaces_rwsem);
+ goto out_unlink_ns;
+ }
nvme_ns_add_to_ctrl_list(ns);
up_write(&ctrl->namespaces_rwsem);
nvme_get_ctrl(ctrl);
@@ -4550,6 +4558,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
list_for_each_entry(ns, &ctrl->namespaces, list)
blk_mq_unfreeze_queue(ns->queue);
up_read(&ctrl->namespaces_rwsem);
+ clear_bit(NVME_CTRL_FROZEN, &ctrl->flags);
}
EXPORT_SYMBOL_GPL(nvme_unfreeze);
@@ -4583,6 +4592,7 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
{
struct nvme_ns *ns;
+ set_bit(NVME_CTRL_FROZEN, &ctrl->flags);
down_read(&ctrl->namespaces_rwsem);
list_for_each_entry(ns, &ctrl->namespaces, list)
blk_freeze_queue_start(ns->queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 68313148e6ed5..bc6d6eabab66b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -251,6 +251,7 @@ enum nvme_ctrl_flags {
NVME_CTRL_STOPPED = 3,
NVME_CTRL_SKIP_ID_CNS_CS = 4,
NVME_CTRL_DIRTY_CAPABILITY = 5,
+ NVME_CTRL_FROZEN = 6,
};
struct nvme_ctrl {
--
2.42.0
next prev parent reply other threads:[~2023-12-11 13:54 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-11 13:50 [PATCH AUTOSEL 6.6 01/47] hwtracing: hisi_ptt: Handle the interrupt in hardirq context Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 02/47] hwtracing: hisi_ptt: Don't try to attach a task Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 03/47] ASoC: amd: yc: Add HP 255 G10 into quirk table Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 04/47] ASoC: wm8974: Correct boost mixer inputs Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 05/47] arm64: dts: rockchip: fix rk356x pcie msg interrupt name Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 06/47] ASoC: Intel: Skylake: Fix mem leak in few functions Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 07/47] ASoC: nau8822: Fix incorrect type in assignment and cast to restricted __be16 Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 08/47] ASoC: SOF: topology: Fix mem leak in sof_dai_load() Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 09/47] ASoC: Intel: Skylake: mem leak in skl register function Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 10/47] ASoC: cs43130: Fix the position of const qualifier Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 11/47] ASoC: cs43130: Fix incorrect frame delay configuration Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 12/47] ASoC: fsl_xcvr: Enable 2 * TX bit clock for spdif only case Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 13/47] ASoC: rt5650: add mutex to avoid the jack detection failure Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 14/47] ASoC: SOF: mediatek: mt8186: Add Google Steelix topology compatible Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 15/47] ASoC: fsl_xcvr: refine the requested phy clock frequency Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 16/47] ASoC: Intel: skl_hda_dsp_generic: Drop HDMI routes when HDMI is not available Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 17/47] ASoC: SOF: ipc4-topology: Add core_mask in struct snd_sof_pipeline Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 18/47] ASoC: SOF: sof-audio: Modify logic for enabling/disabling topology cores Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 19/47] nouveau/tu102: flush all pdbs on vmm flush Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 20/47] ASoC: amd: yc: Add DMI entry to support System76 Pangolin 13 Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 21/47] ASoC: hdac_hda: Conditionally register dais for HDMI and Analog Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 22/47] ASoC: SOF: ipc4-topology: Correct data structures for the SRC module Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 23/47] ASoC: SOF: ipc4-topology: Correct data structures for the GAIN module Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 24/47] pds_vdpa: fix up format-truncation complaint Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 25/47] pds_vdpa: clear config callback when status goes to 0 Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 26/47] pds_vdpa: set features order Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 27/47] net/tg3: fix race condition in tg3_reset_task() Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 28/47] ASoC: da7219: Support low DC impedance headset Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 29/47] ASoC: ops: add correct range check for limiting volume Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 30/47] nvme: introduce helper function to get ctrl state Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 31/47] nvme: ensure reset state check ordering Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 32/47] nvme-ioctl: move capable() admin check to the end Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 33/47] nvme: prevent potential spectre v1 gadget Sasha Levin
2023-12-11 13:50 ` Sasha Levin [this message]
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 35/47] arm64: dts: rockchip: Fix PCI node addresses on rk3399-gru Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 36/47] mips/smp: Call rcutree_report_cpu_starting() earlier Sasha Levin
2023-12-12 1:45 ` Huacai Chen
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 37/47] r8152: add vendor/device ID pair for ASUS USB-C2500 Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 38/47] drm/amd/display: Use channel_width = 2 for vram table 3.0 Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 39/47] drm/amd/display: Add monitor patch for specific eDP Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 40/47] drm/amdgpu: Add NULL checks for function pointers Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 41/47] drm/exynos: fix a potential error pointer dereference Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 42/47] drm/exynos: fix a wrong error checking Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 43/47] ALSA: pcmtest: stop timer before buffer is released Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 44/47] hwmon: (corsair-psu) Fix probe when built-in Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 45/47] LoongArch: Apply dynamic relocations for LLD Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 46/47] LoongArch: Set unwind stack type to unknown rather than set error flag Sasha Levin
2023-12-11 13:50 ` [PATCH AUTOSEL 6.6 47/47] LoongArch: Preserve syscall nr across execve() Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231211135147.380223-34-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=hch@lst.de \
--cc=kanie@linux.alibaba.com \
--cc=kbusch@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
--cc=stable@vger.kernel.org \
--cc=yaoma@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox