[PATCHv3 0/2] nvme: fix system fault observed while shutting down controller

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv3 0/2] nvme: fix system fault observed while shutting down controller
@ 2024-11-03 18:14 Nilay Shroff
  2024-11-03 18:14 ` [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
  2024-11-03 18:14 ` [PATCHv3 2/2] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
  0 siblings, 2 replies; 6+ messages in thread
From: Nilay Shroff @ 2024-11-03 18:14 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, ming.lei, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

Hi,

This patch series addresses the system fault observed while shutting down 
fabric controller. We already fixed it[1] earlier however it was later 
relaized that we do have a better and optimal way to address it [2].

The first patch in the series reverts the changes implemented in [3].
So essentially we're making keep-alive operation asynchronous again as 
it was earlier.

The second patch in the series fix the kernel crash observed while 
shutting down fabric controller. The system fault was observed due 
to the keep-alive request sneaking in while shutting down fabric 
controller. We encounter the below intermittent kernel crash while 
running blktest nvme/037:

dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP:  c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

We realized  that the above crash is regression caused due to changes
implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
beginning of the controller shutdown code path or before destroying admin 
queue and freeing admin tagset, so that keep-alive wouldn't sneak in or 
interfere with the shutdown operation. However we removed the keep alive 
stop operation from the beginning of the controller shutdown code path in 
commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_
ctrl()") and that now created the possibility of keep-alive sneaking in and 
interfering with the shutdown operation and causing observed kernel crash. 

To fix the observed crash, we decided to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure
that we don't forward progress and delete the admin queue until the keep-
alive operation is finished (if it's in-flight) or cancelled. The second 
patch in the series help address the kernel crash.

[1]https://lore.kernel.org/all/ZxFSkNI2p65ucTB5@kbusch-mbp.dhcp.thefacebook.com/
[2]https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
[3]https://lore.kernel.org/all/20241016030339.54029-3-nilay@linux.ibm.com/

Changes from v2:
    - Move nvme_stop_keep_alive() from nvme_uninit_ctrl() to
      nvme_remove_admin_tag_set() instead of adding it to
      nvme_stop_ctrl() which would help save one callsite of
      nvme_stop_keep_alive() (Ming Lei)
    - The third patch in the series isn't necessary if we avoid the full
      revert and squash the series to just one fixing commit (Keith
      Busch)
Changes from v1:
    - Update the commit log of the third patch to make the intent of the
      changes clear (Sagi Grimberg)

Nilay Shroff (2):
  Revert "nvme: make keep-alive synchronous operation"
  nvme-fabrics: fix kernel crash while shutting down controller

 drivers/nvme/host/core.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation"
  2024-11-03 18:14 [PATCHv3 0/2] nvme: fix system fault observed while shutting down controller Nilay Shroff
@ 2024-11-03 18:14 ` Nilay Shroff
  2024-11-04  0:31   ` Ming Lei
  2024-11-03 18:14 ` [PATCHv3 2/2] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
  1 sibling, 1 reply; 6+ messages in thread
From: Nilay Shroff @ 2024-11-03 18:14 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, ming.lei, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.

It was realized that the fix implemented to contain the race condition
among the keep alive task and the fabric shutdown code path in the
commit d06923670b5ia ("nvme: make keep-alive synchronous operation")
is not optimal.

We also found that the above race condition is a regression caused due
to the changes implemented in commit a54a93d0e359 ("nvme: move stopping
keep-alive into nvme_uninit_ctrl()"). So we decided to revert the commit
d06923670b5a ("nvme: make keep-alive synchronous operation") and then
fix the regression.

Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b149b638453f..ddf07df243d3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1303,9 +1303,10 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
-static void nvme_keep_alive_finish(struct request *rq,
-		blk_status_t status, struct nvme_ctrl *ctrl)
+static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
+						 blk_status_t status)
 {
+	struct nvme_ctrl *ctrl = rq->end_io_data;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
 	unsigned long delay = nvme_keep_alive_work_period(ctrl);
 	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
@@ -1322,17 +1323,20 @@ static void nvme_keep_alive_finish(struct request *rq,
 		delay = 0;
 	}
 
+	blk_mq_free_request(rq);
+
 	if (status) {
 		dev_err(ctrl->device,
 			"failed nvme_keep_alive_end_io error=%d\n",
 				status);
-		return;
+		return RQ_END_IO_NONE;
 	}
 
 	ctrl->ka_last_check_time = jiffies;
 	ctrl->comp_seen = false;
 	if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
+	return RQ_END_IO_NONE;
 }
 
 static void nvme_keep_alive_work(struct work_struct *work)
@@ -1341,7 +1345,6 @@ static void nvme_keep_alive_work(struct work_struct *work)
 			struct nvme_ctrl, ka_work);
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
-	blk_status_t status;
 
 	ctrl->ka_last_check_time = jiffies;
 
@@ -1364,9 +1367,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	nvme_init_request(rq, &ctrl->ka_cmd);
 
 	rq->timeout = ctrl->kato * HZ;
-	status = blk_execute_rq(rq, false);
-	nvme_keep_alive_finish(rq, status, ctrl);
-	blk_mq_free_request(rq);
+	rq->end_io = nvme_keep_alive_end_io;
+	rq->end_io_data = ctrl;
+	blk_execute_rq_nowait(rq, false);
 }
 
 static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCHv3 2/2] nvme-fabrics: fix kernel crash while shutting down controller
  2024-11-03 18:14 [PATCHv3 0/2] nvme: fix system fault observed while shutting down controller Nilay Shroff
  2024-11-03 18:14 ` [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
@ 2024-11-03 18:14 ` Nilay Shroff
  2024-11-04  0:22   ` Ming Lei
  1 sibling, 1 reply; 6+ messages in thread
From: Nilay Shroff @ 2024-11-03 18:14 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, ming.lei, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

The above kernel crash is regression caused due to changes implemented
in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive before destroyin
g the admin queue and freeing the admin tagset so that it wouldn't sneak
in during the shutdown operation. However we removed the keep alive stop
operation from the beginning of the controller shutdown code path in commit
a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
and added it under nvme_uninit_ctrl() which executes very late in the
shutdown code path after the admin queue is destroyed and its tagset is
removed. So this change created the possibility of keep-alive sneaking in
and interfering with the shutdown operation and causing observed kernel
crash.

To fix the observed crash, we decided to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure
that we don't forward progress and delete the admin queue until the keep-
alive operation is finished (if it's in-flight) or cancelled and that would
help contain the race condition explained above and hence avoid the crash.

Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
Link: https://lore.kernel.org/all/1a21f37b-0f2a-4745-8c56-4dc8628d3983@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ddf07df243d3..a55c56123bfe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4593,6 +4593,11 @@ EXPORT_SYMBOL_GPL(nvme_alloc_admin_tag_set);

 void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
 {
+	/*
+	 * As we're about to destroy the queue and free tagset
+	 * we can not have keep-alive work running.
+	 */
+	nvme_stop_keep_alive(ctrl);
 	blk_mq_destroy_queue(ctrl->admin_q);
 	blk_put_queue(ctrl->admin_q);
 	if (ctrl->ops->flags & NVME_F_FABRICS) {
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCHv3 2/2] nvme-fabrics: fix kernel crash while shutting down controller
  2024-11-03 18:14 ` [PATCHv3 2/2] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
@ 2024-11-04  0:22   ` Ming Lei
  0 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2024-11-04  0:22 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce

On Sun, Nov 03, 2024 at 11:44:42PM +0530, Nilay Shroff wrote:
> The nvme keep-alive operation, which executes at a periodic interval,
> could potentially sneak in while shutting down a fabric controller.
> This may lead to a race between the fabric controller admin queue
> destroy code path (invoked while shutting down controller) and hw/hctx
> queue dispatcher called from the nvme keep-alive async request queuing
> operation. This race could lead to the kernel crash shown below:
> 
> Call Trace:
>     autoremove_wake_function+0x0/0xbc (unreliable)
>     __blk_mq_sched_dispatch_requests+0x114/0x24c
>     blk_mq_sched_dispatch_requests+0x44/0x84
>     blk_mq_run_hw_queue+0x140/0x220
>     nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>     process_one_work+0x200/0x4e0
>     worker_thread+0x340/0x504
>     kthread+0x138/0x140
>     start_kernel_thread+0x14/0x18
> 
> While shutting down fabric controller, if nvme keep-alive request sneaks
> in then it would be flushed off. The nvme_keep_alive_end_io function is
> then invoked to handle the end of the keep-alive operation which
> decrements the admin->q_usage_counter and assuming this is the last/only
> request in the admin queue then the admin->q_usage_counter becomes zero.
> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> queue()) which could be potentially running simultaneously on another
> cpu (as this is the controller shutdown code path) would forward
> progress and deletes the admin queue. So, now from this point onward
> we are not supposed to access the admin queue resources. However the
> issue here's that the nvme keep-alive thread running hw/hctx queue
> dispatch operation hasn't yet finished its work and so it could still
> potentially access the admin queue resource while the admin queue had
> been already deleted and that causes the above crash.
> 
> The above kernel crash is regression caused due to changes implemented
> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> nvme_uninit_ctrl()"). Ideally we should stop keep-alive before destroyin
> g the admin queue and freeing the admin tagset so that it wouldn't sneak
> in during the shutdown operation. However we removed the keep alive stop
> operation from the beginning of the controller shutdown code path in commit
> a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
> and added it under nvme_uninit_ctrl() which executes very late in the
> shutdown code path after the admin queue is destroyed and its tagset is
> removed. So this change created the possibility of keep-alive sneaking in
> and interfering with the shutdown operation and causing observed kernel
> crash.
> 
> To fix the observed crash, we decided to move nvme_stop_keep_alive() from
> nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure
> that we don't forward progress and delete the admin queue until the keep-
> alive operation is finished (if it's in-flight) or cancelled and that would
> help contain the race condition explained above and hence avoid the crash.
> 
> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
> Link: https://lore.kernel.org/all/1a21f37b-0f2a-4745-8c56-4dc8628d3983@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>  drivers/nvme/host/core.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index ddf07df243d3..a55c56123bfe 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4593,6 +4593,11 @@ EXPORT_SYMBOL_GPL(nvme_alloc_admin_tag_set);
>  
>  void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
>  {
> +	/*
> +	 * As we're about to destroy the queue and free tagset
> +	 * we can not have keep-alive work running.
> +	 */
> +	nvme_stop_keep_alive(ctrl);
>  	blk_mq_destroy_queue(ctrl->admin_q);
>  	blk_put_queue(ctrl->admin_q);
>  	if (ctrl->ops->flags & NVME_F_FABRICS) {

Looks fine,

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation"
  2024-11-03 18:14 ` [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
@ 2024-11-04  0:31   ` Ming Lei
  2024-11-05  6:07     ` Nilay Shroff
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2024-11-04  0:31 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce

On Sun, Nov 03, 2024 at 11:44:41PM +0530, Nilay Shroff wrote:
> This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
> 
> It was realized that the fix implemented to contain the race condition
> among the keep alive task and the fabric shutdown code path in the
> commit d06923670b5ia ("nvme: make keep-alive synchronous operation")
> is not optimal.

One good commit log is supposed to explain a bit why it isn't optimal
instead of "It was realized" without any details.

> 
> We also found that the above race condition is a regression caused due
> to the changes implemented in commit a54a93d0e359 ("nvme: move stopping
> keep-alive into nvme_uninit_ctrl()"). So we decided to revert the commit
> d06923670b5a ("nvme: make keep-alive synchronous operation") and then
> fix the regression.
> 
> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>

The patch itself looks good:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation"
  2024-11-04  0:31   ` Ming Lei
@ 2024-11-05  6:07     ` Nilay Shroff
  0 siblings, 0 replies; 6+ messages in thread
From: Nilay Shroff @ 2024-11-05  6:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce



On 11/4/24 06:01, Ming Lei wrote:
> On Sun, Nov 03, 2024 at 11:44:41PM +0530, Nilay Shroff wrote:
>> This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
>>
>> It was realized that the fix implemented to contain the race condition
>> among the keep alive task and the fabric shutdown code path in the
>> commit d06923670b5ia ("nvme: make keep-alive synchronous operation")
>> is not optimal.
> 
> One good commit log is supposed to explain a bit why it isn't optimal
> instead of "It was realized" without any details.
> 
I thought the link included in the patch would help. But I think you were correct,
it would help if we include the brief description here in commit log.
No worries... I would add it and spin a new patch version.
>>
>> We also found that the above race condition is a regression caused due
>> to the changes implemented in commit a54a93d0e359 ("nvme: move stopping
>> keep-alive into nvme_uninit_ctrl()"). So we decided to revert the commit
>> d06923670b5a ("nvme: make keep-alive synchronous operation") and then
>> fix the regression.
>>
>> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> 
> The patch itself looks good:
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> 
> 
Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-11-05  6:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-03 18:14 [PATCHv3 0/2] nvme: fix system fault observed while shutting down controller Nilay Shroff
2024-11-03 18:14 ` [PATCHv3 1/2] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
2024-11-04  0:31   ` Ming Lei
2024-11-05  6:07     ` Nilay Shroff
2024-11-03 18:14 ` [PATCHv3 2/2] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
2024-11-04  0:22   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox