[PATCHv2 0/3] nvme: fix system fault observed while shutting down controller

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2 0/3] nvme: fix system fault observed while shutting down controller
@ 2024-10-28 12:47 Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Nilay Shroff @ 2024-10-28 12:47 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

Hi,

This patch series addresses the system fault observed while shutting
down fabric controller. We already fixed it[1] earlier however it was
later relaized that we do have a better and optimal way to address it
[2].

The first patch in the series reverts the changes implemented in [3] and
[4]. So essentially we're making keep-alive operation asynchronous again 
as it was earlier.
The second patch in the series fix the kernel crash observed while 
shutting down fabric controller.
The third patch in the series uses the nvme_ctrl_state function for 
retrieving the controller state.

The system fault was observed due to the keep-alive request sneaking in
while shutting down fabric controller. We encounter the below intermittent
kernel crash while running blktest nvme/037:

dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP:  c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

We realized  that the above crash is regression caused due to changes 
implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into 
nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very 
beggining of the controller shutdown code path so that it wouldn't sneak 
in or interfere with the shutdown operation. However we removed the keep
alive stop operation from the beginning of the controller shutdown code 
path in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_
uninit_ctrl()") and that now created the possibility of keep-alive 
sneaking in and interfering with the shutdown operation and causing 
observed kernel crash. So to fix this crash, now we're adding back the
keep-alive stop operation at very beginning of the fabric controller
shutdown code path so that the actual controller shutdown opeation only
begins after it's ensured that keep-alive operation is not in-flight and
also it can't be scheduled in future. This fixed in the second patch of 
the series. 

The third patch in the series addresses the use of ctrl->lock before
accessing NVMe controller state in nvme_keep_alive_end_io function.
With introduction of helper nvme_ctrl_state, we no longer need to
first acquire ctrl->lock before accessing the NVMe controller state.
So this patch removes the use of ctrl->lock from nvme_keep_alive_end_io
function and replaces it with helper nvme_ctrl_state call.

[1]https://lore.kernel.org/all/ZxFSkNI2p65ucTB5@kbusch-mbp.dhcp.thefacebook.com/
[2]https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
[3]https://lore.kernel.org/all/20241016030339.54029-3-nilay@linux.ibm.com/
[4]https://lore.kernel.org/all/20241016030339.54029-4-nilay@linux.ibm.com/

Changes from v1:
    - Update the commit log of the third patch to make the intent of the
      changes clear (Sagi Grimberg)

Nilay Shroff (3):
  Revert "nvme: make keep-alive synchronous operation"
  nvme-fabrics: fix kernel crash while shutting down controller
  nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function

 drivers/nvme/host/core.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCHv2 1/3] Revert "nvme: make keep-alive synchronous operation"
  2024-10-28 12:47 [PATCHv2 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
@ 2024-10-28 12:47 ` Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
  2 siblings, 0 replies; 4+ messages in thread
From: Nilay Shroff @ 2024-10-28 12:47 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
This reverts commit 599d9f3a10eec69ef28a90161763e4bd7c9c02bf.

It was realized that the fix implemented to avoid the race
condition between keep alive task and the fabric shutdown code
path in the commit d06923670b5ia ("nvme: make keep-alive
synchronous operation") is not optimal.

We also found that the above race condition is regression caused
due to the changes implemented in commit a54a93d0e359 ("nvme: move
stopping keep-alive into nvme_uninit_ctrl()"). So we decided to
first revert the commit d06923670b5a ("nvme: make keep-alive
synchronous operation") and commit 599d9f3a10ee ("nvme: use helper
nvme_ctrl_state in nvme_keep_alive_finish function") and then fix
the regression.

Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d..5016f69e9a15 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1292,12 +1292,14 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
-static void nvme_keep_alive_finish(struct request *rq,
-		blk_status_t status, struct nvme_ctrl *ctrl)
+static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
+						 blk_status_t status)
 {
+	struct nvme_ctrl *ctrl = rq->end_io_data;
+	unsigned long flags;
+	bool startka = false;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
 	unsigned long delay = nvme_keep_alive_work_period(ctrl);
-	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
 
 	/*
 	 * Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1311,17 +1313,25 @@ static void nvme_keep_alive_finish(struct request *rq,
 		delay = 0;
 	}
 
+	blk_mq_free_request(rq);
+
 	if (status) {
 		dev_err(ctrl->device,
 			"failed nvme_keep_alive_end_io error=%d\n",
 				status);
-		return;
+		return RQ_END_IO_NONE;
 	}
 
 	ctrl->ka_last_check_time = jiffies;
 	ctrl->comp_seen = false;
-	if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
+	spin_lock_irqsave(&ctrl->lock, flags);
+	if (ctrl->state == NVME_CTRL_LIVE ||
+	    ctrl->state == NVME_CTRL_CONNECTING)
+		startka = true;
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	if (startka)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
+	return RQ_END_IO_NONE;
 }
 
 static void nvme_keep_alive_work(struct work_struct *work)
@@ -1330,7 +1340,6 @@ static void nvme_keep_alive_work(struct work_struct *work)
 			struct nvme_ctrl, ka_work);
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
-	blk_status_t status;
 
 	ctrl->ka_last_check_time = jiffies;
 
@@ -1353,9 +1362,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	nvme_init_request(rq, &ctrl->ka_cmd);
 
 	rq->timeout = ctrl->kato * HZ;
-	status = blk_execute_rq(rq, false);
-	nvme_keep_alive_finish(rq, status, ctrl);
-	blk_mq_free_request(rq);
+	rq->end_io = nvme_keep_alive_end_io;
+	rq->end_io_data = ctrl;
+	blk_execute_rq_nowait(rq, false);
 }
 
 static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCHv2 2/3] nvme-fabrics: fix kernel crash while shutting down controller
  2024-10-28 12:47 [PATCHv2 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
@ 2024-10-28 12:47 ` Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
  2 siblings, 0 replies; 4+ messages in thread
From: Nilay Shroff @ 2024-10-28 12:47 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

The above kernel crash is regression caused due to changes implemented
in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
beggining of the controller shutdown code path so that it wouldn't
sneak in during the shutdown operation. However we removed the keep
alive stop operation from the beginning of the controller shutdown
code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()") and that now created the possibility of keep-alive
sneaking in and interfering with the shutdown operation and causing
observed kernel crash. So to fix this crash, now we're adding back the
keep-alive stop operation at very beginning of the fabric controller
shutdown code path so that the actual controller shutdown opeation only
begins after it's ensured that keep-alive operation is not in-flight and
also it can't be scheduled in future.

Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5016f69e9a15..865c00ea19e3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_auth_stop(ctrl);
+	/*
+	 * the transport driver may be terminating the admin tagset a little
+	 * later on, so we cannot have the keep-alive work running
+	 */
+	nvme_stop_keep_alive(ctrl);
 	nvme_stop_failfast_work(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCHv2 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
  2024-10-28 12:47 [PATCHv2 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
  2024-10-28 12:47 ` [PATCHv2 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
@ 2024-10-28 12:47 ` Nilay Shroff
  2 siblings, 0 replies; 4+ messages in thread
From: Nilay Shroff @ 2024-10-28 12:47 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

We use WRITE_ONCE() for updating the ctrl->state value. So it makes
sense to pair the read of ctrl->state using READ_ONCE(). The helper
nvme_ctrl_state(), helps reads the ctrl->state using READ_ONCE().

In nvme_keep_alive_end_io function, we first acquire ctrl->lock before
reading ctrl->state. However, with use of helper nvme_ctrl_state()
we can avoid ctrl->lock. So this commit removes the ctrl->lock and
replaces it with nvme_ctrl_state call.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 865c00ea19e3..802634022bfa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1296,10 +1296,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 						 blk_status_t status)
 {
 	struct nvme_ctrl *ctrl = rq->end_io_data;
-	unsigned long flags;
-	bool startka = false;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
 	unsigned long delay = nvme_keep_alive_work_period(ctrl);
+	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
 
 	/*
 	 * Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1324,13 +1323,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 
 	ctrl->ka_last_check_time = jiffies;
 	ctrl->comp_seen = false;
-	spin_lock_irqsave(&ctrl->lock, flags);
-	if (ctrl->state == NVME_CTRL_LIVE ||
-	    ctrl->state == NVME_CTRL_CONNECTING)
-		startka = true;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
-	if (startka)
+	if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
+
 	return RQ_END_IO_NONE;
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-28 12:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28 12:47 [PATCHv2 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
2024-10-28 12:47 ` [PATCHv2 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
2024-10-28 12:47 ` [PATCHv2 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
2024-10-28 12:47 ` [PATCHv2 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox