[PATCH v4 0/3] nvme: system fault while shutting down fabric controller

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] nvme: system fault while shutting down fabric controller
@ 2024-10-16  3:03 Nilay Shroff
  2024-10-16  3:03 ` [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Nilay Shroff @ 2024-10-16  3:03 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

We observed a kernel task hang up and a kernel crash while shutting down
NVMe fabric controller. These issues were observed while running blktest
nvme/037. The first two patches in this series address issues encountered
while running this test. The third patch in the series is an attempt to
use helper nvme_ctrl_state for accessing NVMe controller state.

We intermittently observe the blow kernel task hang while running the
blktest nvme/037. This test setup NVMeOF passthru controller using loop
target, connect to it and then immediately terminate/cleanup the
connection.

dmesg output:
-------------
run blktests nvme/037 at 2024-10-04 00:46:02
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme1: D3 entry latency set to 10 seconds
nvme nvme1: creating 32 I/O queues.
nvme nvme1: new ctrl: "blktests-subsystem-1"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: resetting controller
INFO: task nvme:3082 blocked for more than 120 seconds.
      Not tainted 6.11.0+ #89
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D stack:0     pid:3082  tgid:3082  ppid:2983   flags:0x00042080
Call Trace:
    0xc000000070f5bf90 (unreliable)
    __switch_to+0x148/0x230
    __schedule+0x260/0x6dc
    schedule+0x40/0x100
    blk_mq_freeze_queue_wait+0xa4/0xec
    blk_mq_destroy_queue+0x68/0xac
    nvme_remove_admin_tag_set+0x2c/0xb8 [nvme_core]
    nvme_loop_destroy_admin_queue+0x68/0x88 [nvme_loop]
    nvme_do_delete_ctrl+0x1e0/0x268 [nvme_core]
    nvme_delete_ctrl_sync+0xd4/0x104 [nvme_core]
    nvme_sysfs_delete+0x78/0x90 [nvme_core]
    dev_attr_store+0x34/0x50
    sysfs_kf_write+0x64/0x78
    kernfs_fop_write_iter+0x1b0/0x290
    vfs_write+0x3bc/0x4f8
    ksys_write+0x84/0x140
    system_call_exception+0x124/0x320
    system_call_vectored_common+0x15c/0x2ec

As we can see from the above trace that nvme task hangs up indefinitely
while shutting down loop controller. This task couldn't forward progress
because it's waiting for unfinished/outstanding requests which haven't
yet finished.

The first patch in the series fixes the above hang by ensuring that while
shutting down nvme loop controller, we flush off any pending I/O to the
completion, which might have been queued after that queue has been quiesced.
So the first patch adds a missing unquiesce admin and IO queue operation in
the nvme loop driver just before the respective queue is destroyed.

The second patch in the series fixes another issue with the nvme keep-alive
operation. The keep-alive operation could potentially sneak in while
the fabric controller is shutting down. We encounter the below intermittent
kernel crash while running blktest nvme/037:

dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP:  c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

The above crash occurred while shutting down fabric/loop controller.

While shutting down fabric controller, if nvme keep-alive request sneaks in
and later flushed off then nvme_keep_alive_end_io() function is
asynchronously invoked to handle the end of the keep-alive operation. The
nvme_keep_alive_end_io() decrements the admin->q_usage_counter and
assuming this is the last/only request in the admin queue then the admin
->q_usage_counter becomes zero. If that happens then blk-mq destroy
queue operation (blk_mq_destroy_queue()) which could be potentially
running simultaneously on another cpu (as this is the controller shutdown
code path) would forward progress and deletes the admin queue. However at
the same time nvme keep-alive thread running on another cpu hasn't yet
returned/finished from it's async blk-mq request operation (i.e blk_execute_
rq_nowait()) and so it could still access admin queue resources which could
have been already released from controller shutdown code path and that
causes the observed symptom.
For instance, find below the sequence of operations running simultaneously
on two cpus and causing this issue:

cpu0:
nvme_keep_alive_work()
  ->blk_execute_rq_no_wait()
    ->blk_mq_run_hw_queue()
      ->blk_mq_sched_dispatch_requests()
        ->__blk_mq_sched_dispatch_requests()
          ->blk_mq_dispatch_rq_list()
            ->nvme_loop_queue_rq()
              ->nvme_fail_nonready_command() -- here keep-alive req fails because admin queue is shutting down
                ->nvme_complete_rq()
                  ->nvme_end_req()
                    ->blk_mq_end_request()
                      ->__blk_mq_end_request()
                        ->nvme_keep_alive_end_io() -- here we decrement admin->q_usage_counter
cpu1:
nvme_loop_delete_ctrl_host()
  ->nvme_loop_shutdown_ctrl()
    ->nvme_loop_destroy_admin_queue()
      ->nvme_remove_admin_tag_set()
        ->blk_mq_destroy_queue() -- here we wait until admin->q_usage_counter reches to zero
        ->blk_put_queue() -- here we destroy queue once admin->q_usage_counter becomes zero
                          -- From here on we are not supposed to access admin queue
                             resources, however, nvme keep-alive thread on cpu0 running
                             hw/hctx queue dispatch operation may still access the admin
                             qeueue pointer and causing the observed crash.

So prima facie, from the above trace it appears that the nvme keep-alive
thread running on one cpu races with the shutdown controller operation
which could be running on another cpu.

The second patch in the series addresses above issue by making nvme keep-
alive a synchronous operation so that we decrement admin->q_usage_counter
only after keep-alive operation/command finish and returns status.
This would also ensure that blk_mq_destroy_queue() doesn't return until
the nvme keep-alive thread finish it's work and so it's safe to destroy the
queue.

Moreover, the Keep-alive command is lightweight and low-frequency, making
it a synchronous approach shall be reasonable from a performance
perspective. Since this command is not frequent compared to other NVMe
operations (like reads/writes), it does not introduce a significant
performance overhead when handled synchronously.

The third patch in the series addresses the use of ctrl->lock before
accessing NVMe controller state in nvme_keep_alive_finish function.
With introduction of helper nvme_ctrl_state, we no longer need to
first acquire ctrl->lock before accessing the NVMe controller state.
So this patch removes the use of ctrl->lock from nvme_keep_alive_finish
function and replaces it with helper nvme_ctrl_state call.

Changelog:

Changes since v3:
    - Spell out "queue-usage-ref-counter" as "->q_usage_counter"
      (Christoph Hellwig)
    - Improve the commit log of second patch in the series
      (Christoph Hellwig)
    - Few minor nitpick fix (Christoph Hellwig)

Changes since v2:
    - In third patch in this series, remove the redudndant
      "startka" variable from nvme_keep_alive_finish function
      and unify logic to queue delayed work (Damien Le Moal)

Changes since v1:
    - Split the second patch and move the use of helper
      nvme_ctrl_state call in third patch (Christoph Hellwig)

Nilay Shroff (3):
  nvme-loop: flush off pending I/O while shutting down loop controller
  nvme: make keep-alive synchronous operation
  nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function

 drivers/nvme/host/core.c   | 27 +++++++++------------------
 drivers/nvme/target/loop.c | 13 +++++++++++++
 2 files changed, 22 insertions(+), 18 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller
  2024-10-16  3:03 [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
@ 2024-10-16  3:03 ` Nilay Shroff
  2024-10-20 21:37   ` Sagi Grimberg
  2024-10-16  3:03 ` [PATCH v4 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2024-10-16  3:03 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

While shutting down loop controller, we first quiesce the admin/IO queue,
delete the admin/IO tag-set and then at last destroy the admin/IO queue.
However it's quite possible that during the window between quiescing and
destroying of the admin/IO queue, some admin/IO request might sneak in
and if that happens then we could potentially encounter a hung task
because shutdown operation can't forward progress until any pending I/O
is flushed off.

This commit helps ensure that before destroying the admin/IO queue, we
unquiesce the admin/IO queue so that any outstanding requests, which are
added after the admin/IO queue is quiesced, are now flushed to its
completion.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/target/loop.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index e32790d8fc26..a9d112d34d4f 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -265,6 +265,13 @@ static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl)
 {
 	if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags))
 		return;
+	/*
+	 * It's possible that some requests might have been added
+	 * after admin queue is stopped/quiesced. So now start the
+	 * queue to flush these requests to the completion.
+	 */
+	nvme_unquiesce_admin_queue(&ctrl->ctrl);
+
 	nvmet_sq_destroy(&ctrl->queues[0].nvme_sq);
 	nvme_remove_admin_tag_set(&ctrl->ctrl);
 }
@@ -297,6 +304,12 @@ static void nvme_loop_destroy_io_queues(struct nvme_loop_ctrl *ctrl)
 		nvmet_sq_destroy(&ctrl->queues[i].nvme_sq);
 	}
 	ctrl->ctrl.queue_count = 1;
+	/*
+	 * It's possible that some requests might have been added
+	 * after io queue is stopped/quiesced. So now start the
+	 * queue to flush these requests to the completion.
+	 */
+	nvme_unquiesce_io_queues(&ctrl->ctrl);
 }
 
 static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-16  3:03 [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
  2024-10-16  3:03 ` [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
@ 2024-10-16  3:03 ` Nilay Shroff
  2024-10-16  7:59   ` Christoph Hellwig
  2024-10-20 21:49   ` Sagi Grimberg
  2024-10-16  3:03 ` [PATCH v4 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function Nilay Shroff
  2024-10-17 18:08 ` [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Keith Busch
  3 siblings, 2 replies; 16+ messages in thread
From: Nilay Shroff @ 2024-10-16  3:03 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

This fix helps avoid the observed crash by implementing keep-alive as a
synchronous operation so that we decrement admin->q_usage_counter only
after keep-alive command finished its execution and returns the command
status back up to its caller (blk_execute_rq()). This would ensure that
fabric shutdown code path doesn't destroy the fabric admin queue until
keep-alive request finished execution and also keep-alive thread is not
running hw/hctx queue dispatch operation.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 02897f0564a3..b7541dda36b4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1292,10 +1292,9 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
-static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
-						 blk_status_t status)
+static void nvme_keep_alive_finish(struct request *rq,
+		blk_status_t status, struct nvme_ctrl *ctrl)
 {
-	struct nvme_ctrl *ctrl = rq->end_io_data;
 	unsigned long flags;
 	bool startka = false;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
@@ -1313,13 +1312,11 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		delay = 0;
 	}
 
-	blk_mq_free_request(rq);
-
 	if (status) {
 		dev_err(ctrl->device,
 			"failed nvme_keep_alive_end_io error=%d\n",
 				status);
-		return RQ_END_IO_NONE;
+		return;
 	}
 
 	ctrl->ka_last_check_time = jiffies;
@@ -1331,7 +1328,6 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	if (startka)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
-	return RQ_END_IO_NONE;
 }
 
 static void nvme_keep_alive_work(struct work_struct *work)
@@ -1340,6 +1336,7 @@ static void nvme_keep_alive_work(struct work_struct *work)
 			struct nvme_ctrl, ka_work);
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
+	blk_status_t status;
 
 	ctrl->ka_last_check_time = jiffies;
 
@@ -1362,9 +1359,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	nvme_init_request(rq, &ctrl->ka_cmd);
 
 	rq->timeout = ctrl->kato * HZ;
-	rq->end_io = nvme_keep_alive_end_io;
-	rq->end_io_data = ctrl;
-	blk_execute_rq_nowait(rq, false);
+	status = blk_execute_rq(rq, false);
+	nvme_keep_alive_finish(rq, status, ctrl);
+	blk_mq_free_request(rq);
 }
 
 static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
  2024-10-16  3:03 [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
  2024-10-16  3:03 ` [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
  2024-10-16  3:03 ` [PATCH v4 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
@ 2024-10-16  3:03 ` Nilay Shroff
  2024-10-17 18:08 ` [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Keith Busch
  3 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2024-10-16  3:03 UTC (permalink / raw)
  To: linux-nvme
  Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
	Nilay Shroff

We no more need acquiring ctrl->lock before accessing the
NVMe controller state and instead we can now use the helper
nvme_ctrl_state. So replace the use of ctrl->lock from
nvme_keep_alive_finish function with nvme_ctrl_state call.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b7541dda36b4..5a6a6099cbd1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1295,10 +1295,9 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 static void nvme_keep_alive_finish(struct request *rq,
 		blk_status_t status, struct nvme_ctrl *ctrl)
 {
-	unsigned long flags;
-	bool startka = false;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
 	unsigned long delay = nvme_keep_alive_work_period(ctrl);
+	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
 
 	/*
 	 * Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1321,12 +1320,7 @@ static void nvme_keep_alive_finish(struct request *rq,
 
 	ctrl->ka_last_check_time = jiffies;
 	ctrl->comp_seen = false;
-	spin_lock_irqsave(&ctrl->lock, flags);
-	if (ctrl->state == NVME_CTRL_LIVE ||
-	    ctrl->state == NVME_CTRL_CONNECTING)
-		startka = true;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
-	if (startka)
+	if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-16  3:03 ` [PATCH v4 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
@ 2024-10-16  7:59   ` Christoph Hellwig
  2024-10-20 21:49   ` Sagi Grimberg
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2024-10-16  7:59 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] nvme: system fault while shutting down fabric controller
  2024-10-16  3:03 [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
                   ` (2 preceding siblings ...)
  2024-10-16  3:03 ` [PATCH v4 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function Nilay Shroff
@ 2024-10-17 18:08 ` Keith Busch
  2024-10-20 21:51   ` Sagi Grimberg
  3 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2024-10-17 18:08 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce

On Wed, Oct 16, 2024 at 08:33:13AM +0530, Nilay Shroff wrote:
> We observed a kernel task hang up and a kernel crash while shutting down
> NVMe fabric controller. These issues were observed while running blktest
> nvme/037. The first two patches in this series address issues encountered
> while running this test. The third patch in the series is an attempt to
> use helper nvme_ctrl_state for accessing NVMe controller state.
> 
> We intermittently observe the blow kernel task hang while running the
> blktest nvme/037. This test setup NVMeOF passthru controller using loop
> target, connect to it and then immediately terminate/cleanup the
> connection.

Thanks, applied to nvme-6.12.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller
  2024-10-16  3:03 ` [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
@ 2024-10-20 21:37   ` Sagi Grimberg
  0 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2024-10-20 21:37 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-16  3:03 ` [PATCH v4 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
  2024-10-16  7:59   ` Christoph Hellwig
@ 2024-10-20 21:49   ` Sagi Grimberg
  2024-10-21 11:58     ` Nilay Shroff
  1 sibling, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2024-10-20 21:49 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce

Hey Nilay,

First, can you please phrase the patch title to describe what it
is addressing, not what it is changing, that can come (as it does)
in the body.

Second, I don't think we want to spend a thread waiting for
synchronous IO for the purpose of keep alive. Is this the only
way that this can be done?

Third, was this observed with non-loop transports?


On 16/10/2024 6:03, Nilay Shroff wrote:
> The nvme keep-alive operation, which executes at a periodic interval,
> could potentially sneak in while shutting down a fabric controller.
> This may lead to a race between the fabric controller admin queue
> destroy code path (invoked while shutting down controller) and hw/hctx
> queue dispatcher called from the nvme keep-alive async request queuing
> operation. This race could lead to the kernel crash shown below:

What is different about keep-alive than any other command?

>
> Call Trace:
>      autoremove_wake_function+0x0/0xbc (unreliable)
>      __blk_mq_sched_dispatch_requests+0x114/0x24c
>      blk_mq_sched_dispatch_requests+0x44/0x84
>      blk_mq_run_hw_queue+0x140/0x220
>      nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>      process_one_work+0x200/0x4e0
>      worker_thread+0x340/0x504
>      kthread+0x138/0x140
>      start_kernel_thread+0x14/0x18

What is the "crash" here? use-after-free?

>
> While shutting down fabric controller, if nvme keep-alive request sneaks
> in then it would be flushed off. The nvme_keep_alive_end_io function is
> then invoked to handle the end of the keep-alive operation which
> decrements the admin->q_usage_counter and assuming this is the last/only
> request in the admin queue then the admin->q_usage_counter becomes zero.
> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> queue()) which could be potentially running simultaneously on another
> cpu (as this is the controller shutdown code path) would forward
> progress and deletes the admin queue. So, now from this point onward
> we are not supposed to access the admin queue resources. However the
> issue here's that the nvme keep-alive thread running hw/hctx queue
> dispatch operation hasn't yet finished its work and so it could still
> potentially access the admin queue resource while the admin queue had
> been already deleted and that causes the above crash.

Again, it is unclear to me what justifies the cure here. Every command can
race with the shutdown, and it can be sync or async.

>
> This fix helps avoid the observed crash by implementing keep-alive as a
> synchronous operation so that we decrement admin->q_usage_counter only
> after keep-alive command finished its execution and returns the command
> status back up to its caller (blk_execute_rq()). This would ensure that
> fabric shutdown code path doesn't destroy the fabric admin queue until
> keep-alive request finished execution and also keep-alive thread is not
> running hw/hctx queue dispatch operation.

I'd want to see exactly what is causing this race because we already flush
the keep alive work when starting the shutdown... so it only means it is
incorrectly firing again?

What I think we should do instead, is given that we already cancel_sync 
the keep
alive work before making progress with the teardown, we should just make 
sure
that it doesn't triggered afterwards again if it does.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] nvme: system fault while shutting down fabric controller
  2024-10-17 18:08 ` [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Keith Busch
@ 2024-10-20 21:51   ` Sagi Grimberg
  2024-10-21 12:03     ` Nilay Shroff
  0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2024-10-20 21:51 UTC (permalink / raw)
  To: Keith Busch, Nilay Shroff
  Cc: linux-nvme, hch, axboe, chaitanyak, dlemoal, gjoyce




On 17/10/2024 21:08, Keith Busch wrote:
> On Wed, Oct 16, 2024 at 08:33:13AM +0530, Nilay Shroff wrote:
>> We observed a kernel task hang up and a kernel crash while shutting down
>> NVMe fabric controller. These issues were observed while running blktest
>> nvme/037. The first two patches in this series address issues encountered
>> while running this test. The third patch in the series is an attempt to
>> use helper nvme_ctrl_state for accessing NVMe controller state.
>>
>> We intermittently observe the blow kernel task hang while running the
>> blktest nvme/037. This test setup NVMeOF passthru controller using loop
>> target, connect to it and then immediately terminate/cleanup the
>> connection.
> Thanks, applied to nvme-6.12.

ugh.. looks like I'm late (after a too long of a delay... I know)

I think we want to sort this out, it is not clear to me that the approach
taken is the correct one.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-20 21:49   ` Sagi Grimberg
@ 2024-10-21 11:58     ` Nilay Shroff
  2024-10-21 15:14       ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2024-10-21 11:58 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce



On 10/21/24 03:19, Sagi Grimberg wrote:
> Hey Nilay,
> 
> First, can you please phrase the patch title to describe what it
> is addressing, not what it is changing, that can come (as it does)
> in the body.
> 
sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?

> Second, I don't think we want to spend a thread waiting for
> synchronous IO for the purpose of keep alive. Is this the only
> way that this can be done?
> 

> Third, was this observed with non-loop transports?
> 
Yes it could happen even with non-loop fabric transports.

> 
> On 16/10/2024 6:03, Nilay Shroff wrote:
>> The nvme keep-alive operation, which executes at a periodic interval,
>> could potentially sneak in while shutting down a fabric controller.
>> This may lead to a race between the fabric controller admin queue
>> destroy code path (invoked while shutting down controller) and hw/hctx
>> queue dispatcher called from the nvme keep-alive async request queuing
>> operation. This race could lead to the kernel crash shown below:
> 
> What is different about keep-alive than any other command?
The one difference is, of course, keep-alive is async but we may have
other async requests running while shutting down the controller. I found 
that, when compared the keep-alive with other commands, the keep-alive
runs in the worker thread context but other commands (for instance async 
read/write) runs in the user context and that uses the blk plugging. When we 
use plugging, the call path for issuing request to driver is quite different 
from that of issuing request using blk_execute_rq_nowait(). This subtle 
difference is probably the reason not causing the kernel crash issuing 
read/write command while shutting down the fabric controller.

The call path for blk plugging for issuing the request to driver:
__blk_flush_plug()
  -> blk_mq_flush_plug_list()
    ->blk_mq_dispatch_plug_list()

The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before 
dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue 
function) is protected from the queue being destroyed on other cpu thread.
So that means that even after a request (assuming the only request in queue) 
is cancelled or flushed off due to fabric controller is shutting down, the 
blk-mq destroy code path would not be able to forward progress until the 
->q_usage_counter is decremented and becomes zero. And we can see in 
blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after 
the blk_mq_run_hw_queue function returns. 
> 
>>
>> Call Trace:
>>      autoremove_wake_function+0x0/0xbc (unreliable)
>>      __blk_mq_sched_dispatch_requests+0x114/0x24c
>>      blk_mq_sched_dispatch_requests+0x44/0x84
>>      blk_mq_run_hw_queue+0x140/0x220
>>      nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>      process_one_work+0x200/0x4e0
>>      worker_thread+0x340/0x504
>>      kthread+0x138/0x140
>>      start_kernel_thread+0x14/0x18
> 
> What is the "crash" here? use-after-free?
yes it's use-after-free kernel crash.
> 
>>
>> While shutting down fabric controller, if nvme keep-alive request sneaks
>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>> then invoked to handle the end of the keep-alive operation which
>> decrements the admin->q_usage_counter and assuming this is the last/only
>> request in the admin queue then the admin->q_usage_counter becomes zero.
>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>> queue()) which could be potentially running simultaneously on another
>> cpu (as this is the controller shutdown code path) would forward
>> progress and deletes the admin queue. So, now from this point onward
>> we are not supposed to access the admin queue resources. However the
>> issue here's that the nvme keep-alive thread running hw/hctx queue
>> dispatch operation hasn't yet finished its work and so it could still
>> potentially access the admin queue resource while the admin queue had
>> been already deleted and that causes the above crash.
> 
> Again, it is unclear to me what justifies the cure here. Every command can
> race with the shutdown, and it can be sync or async.
> 
While I was researching on this issue, I figured that the nvme keep-alive is
probably the only async request running in worker thread context and not well 
synchronized with the fabric controller shutdown code path. Yes we may have other
async request (for instance async read/write) which might get-in the way of the 
fabric controller shutdown code path but those mostly run in the user context and
hence have a different code path than keep-alive command for submitting/queuing 
the request.

>>
>> This fix helps avoid the observed crash by implementing keep-alive as a
>> synchronous operation so that we decrement admin->q_usage_counter only
>> after keep-alive command finished its execution and returns the command
>> status back up to its caller (blk_execute_rq()). This would ensure that
>> fabric shutdown code path doesn't destroy the fabric admin queue until
>> keep-alive request finished execution and also keep-alive thread is not
>> running hw/hctx queue dispatch operation.
> 
> I'd want to see exactly what is causing this race because we already flush
> the keep alive work when starting the shutdown... so it only means it is
> incorrectly firing again?
> 
> What I think we should do instead, is given that we already cancel_sync the keep
> alive work before making progress with the teardown, we should just make sure
> that it doesn't triggered afterwards again if it does.

No, keep-alive doesn't incorrectly fired, however once a keep-alive request is 
allocated but before it's queued, if the shutdown code path come in its way and 
hence mark the queue as dying (i.e. setting QUEUE_FLAG_DYING on request-queue as 
well as clearing NVME_TCP_Q_LIVE in case of tcp transport or clearing NVME_LOOP_Q_LIVE
in case of loop transport) then this issue manifests. Let me briefly now show the 
place in the queue dispatcher code where we see this crash.

block layer code path (cpu 0):
blk_execute_rq_nowait()
  -> blk_mq_run_hw_queue()
    -> blk_mq_sched_dispatch_requests()
      -> __blk_mq_sched_dispatch_requests()
        -> blk_mq_do_dispatch_ctx()
          -> blk_mq_dispatch_rq_list()
            -> q->mq_ops->queue_rq()

As we see in the above trace, the function which is called to submit req to 
the driver is ->queue_rq(). Assuming we've async keep-alive request queued 
just after the request-queue has been marked as dying, the driver would not 
process this request but instead fails that request: 

loop driver call path for failed request(cpu 0):
nvme_loop_queue_rq()
  -> nvme_fail_nonready_command()
    -> nvme_host_path_error()
      -> nvme_complete_rq() 
        -> nvme_end_req()
          -> blk_mq_end_request()
            -> __blk_mq_end_request()
              -> nvme_keep_alive_end_io() 
                -> blk_mq_free_request() <<< here we decrement the ->q_usage_counter

Assuming the keep-alive is the only command in the queue, nvme_keep_alive_end_io()
decrements ->q_usage_counter in the above code path and hence now ->q_usage_counter
reaches to zero. At this time the shutdown code path, which could be potentially running 
on another cpu thread and is waiting for ->q_usage_counter to become zero, would now 
forward progress and deletes the queue. Please see below:

fabric controller shutdown code path (cpu 1):
nvme_remove_admin_tag_set()
  -> blk_mq_destroy_queue() 
    -> blk_mq_freeze_queue_wait() <<< here we wait for ->q_usage_counter to become zero

If the ->q_usage_counter becomes zero, the above code forward progress. So now the 
blk_mq_destroy_queue function next invokes blk_put_queue() which would delete the 
admin queue. 

Referring back to the loop diver call path, running on cpu 0, would now return back 
to the block layer. The block layer hasn't yet finished the dispatcher work and it
returns back to blk_mq_dispatch_rq_list(). The blk_mq_dispatch_rq_list() as well as 
its caller blk_mq_do_dispatch_ctx() running dispatcher code could still access the 
@hctx, however this @hctx had been already deleted on another cpu (under the controller 
shutdown code path as described above) and that cause the kernel crash. 

I hope that above description shall help with understanding the root cause. But please 
let me know if still have any further question.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] nvme: system fault while shutting down fabric controller
  2024-10-20 21:51   ` Sagi Grimberg
@ 2024-10-21 12:03     ` Nilay Shroff
  0 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2024-10-21 12:03 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: linux-nvme, hch, axboe, chaitanyak, dlemoal, gjoyce



On 10/21/24 03:21, Sagi Grimberg wrote:
> 
> 
> 
> On 17/10/2024 21:08, Keith Busch wrote:
>> On Wed, Oct 16, 2024 at 08:33:13AM +0530, Nilay Shroff wrote:
>>> We observed a kernel task hang up and a kernel crash while shutting down
>>> NVMe fabric controller. These issues were observed while running blktest
>>> nvme/037. The first two patches in this series address issues encountered
>>> while running this test. The third patch in the series is an attempt to
>>> use helper nvme_ctrl_state for accessing NVMe controller state.
>>>
>>> We intermittently observe the blow kernel task hang while running the
>>> blktest nvme/037. This test setup NVMeOF passthru controller using loop
>>> target, connect to it and then immediately terminate/cleanup the
>>> connection.
>> Thanks, applied to nvme-6.12.
> 
> ugh.. looks like I'm late (after a too long of a delay... I know)
> 
> I think we want to sort this out, it is not clear to me that the approach
> taken is the correct one.
I welcome you comments and also willing to fix this if you think the approach 
is not optimal... 

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-21 11:58     ` Nilay Shroff
@ 2024-10-21 15:14       ` Sagi Grimberg
  2024-10-22  6:54         ` Nilay Shroff
  0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2024-10-21 15:14 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce




On 21/10/2024 14:58, Nilay Shroff wrote:
>
> On 10/21/24 03:19, Sagi Grimberg wrote:
>> Hey Nilay,
>>
>> First, can you please phrase the patch title to describe what it
>> is addressing, not what it is changing, that can come (as it does)
>> in the body.
>>
> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>
>> Second, I don't think we want to spend a thread waiting for
>> synchronous IO for the purpose of keep alive. Is this the only
>> way that this can be done?
>>
>> Third, was this observed with non-loop transports?
>>
> Yes it could happen even with non-loop fabric transports.
>
>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>> The nvme keep-alive operation, which executes at a periodic interval,
>>> could potentially sneak in while shutting down a fabric controller.
>>> This may lead to a race between the fabric controller admin queue
>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>> queue dispatcher called from the nvme keep-alive async request queuing
>>> operation. This race could lead to the kernel crash shown below:
>> What is different about keep-alive than any other command?
> The one difference is, of course, keep-alive is async but we may have
> other async requests running while shutting down the controller. I found
> that, when compared the keep-alive with other commands, the keep-alive
> runs in the worker thread context but other commands (for instance async
> read/write) runs in the user context and that uses the blk plugging. When we
> use plugging, the call path for issuing request to driver is quite different
> from that of issuing request using blk_execute_rq_nowait(). This subtle
> difference is probably the reason not causing the kernel crash issuing
> read/write command while shutting down the fabric controller.
>
> The call path for blk plugging for issuing the request to driver:
> __blk_flush_plug()
>    -> blk_mq_flush_plug_list()
>      ->blk_mq_dispatch_plug_list()
>
> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
> function) is protected from the queue being destroyed on other cpu thread.
> So that means that even after a request (assuming the only request in queue)
> is cancelled or flushed off due to fabric controller is shutting down, the
> blk-mq destroy code path would not be able to forward progress until the
> ->q_usage_counter is decremented and becomes zero. And we can see in
> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
> the blk_mq_run_hw_queue function returns.

Sounds like an issue that is not related to keep-alive at all.

>>> Call Trace:
>>>       autoremove_wake_function+0x0/0xbc (unreliable)
>>>       __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>       blk_mq_sched_dispatch_requests+0x44/0x84
>>>       blk_mq_run_hw_queue+0x140/0x220
>>>       nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>       process_one_work+0x200/0x4e0
>>>       worker_thread+0x340/0x504
>>>       kthread+0x138/0x140
>>>       start_kernel_thread+0x14/0x18
>> What is the "crash" here? use-after-free?
> yes it's use-after-free kernel crash.
>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>> then invoked to handle the end of the keep-alive operation which
>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>> queue()) which could be potentially running simultaneously on another
>>> cpu (as this is the controller shutdown code path) would forward
>>> progress and deletes the admin queue. So, now from this point onward
>>> we are not supposed to access the admin queue resources. However the
>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>> dispatch operation hasn't yet finished its work and so it could still
>>> potentially access the admin queue resource while the admin queue had
>>> been already deleted and that causes the above crash.
>> Again, it is unclear to me what justifies the cure here. Every command can
>> race with the shutdown, and it can be sync or async.
>>
> While I was researching on this issue, I figured that the nvme keep-alive is
> probably the only async request running in worker thread context and not well
> synchronized with the fabric controller shutdown code path. Yes we may have other
> async request (for instance async read/write) which might get-in the way of the
> fabric controller shutdown code path but those mostly run in the user context and
> hence have a different code path than keep-alive command for submitting/queuing
> the request.

This sounds like we papered around the problem. is it not allowed to 
call blk_execute_rq_nowait
from a kthread?

>
>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>> synchronous operation so that we decrement admin->q_usage_counter only
>>> after keep-alive command finished its execution and returns the command
>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>> keep-alive request finished execution and also keep-alive thread is not
>>> running hw/hctx queue dispatch operation.
>> I'd want to see exactly what is causing this race because we already flush
>> the keep alive work when starting the shutdown... so it only means it is
>> incorrectly firing again?
>>
>> What I think we should do instead, is given that we already cancel_sync the keep
>> alive work before making progress with the teardown, we should just make sure
>> that it doesn't triggered afterwards again if it does.
> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
> allocated but before it's queued,

The queue teardown quiesce the admin queue first, which should prevent 
the dispatch.
You probably mean that it runs a little later, after the dispatch passes 
the quiesced test...

The question is, why did blk_mq_quiesce_queue did not wait for rcu 
grace? which should
have ensured that there is no dispatch going on? at least that is what 
it's supposed to do...
something doesn't add up


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-21 15:14       ` Sagi Grimberg
@ 2024-10-22  6:54         ` Nilay Shroff
  2024-10-22  9:43           ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2024-10-22  6:54 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce



On 10/21/24 20:44, Sagi Grimberg wrote:
> 
> 
> 
> On 21/10/2024 14:58, Nilay Shroff wrote:
>>
>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>> Hey Nilay,
>>>
>>> First, can you please phrase the patch title to describe what it
>>> is addressing, not what it is changing, that can come (as it does)
>>> in the body.
>>>
>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>
>>> Second, I don't think we want to spend a thread waiting for
>>> synchronous IO for the purpose of keep alive. Is this the only
>>> way that this can be done?
>>>
>>> Third, was this observed with non-loop transports?
>>>
>> Yes it could happen even with non-loop fabric transports.
>>
>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>> could potentially sneak in while shutting down a fabric controller.
>>>> This may lead to a race between the fabric controller admin queue
>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>> operation. This race could lead to the kernel crash shown below:
>>> What is different about keep-alive than any other command?
>> The one difference is, of course, keep-alive is async but we may have
>> other async requests running while shutting down the controller. I found
>> that, when compared the keep-alive with other commands, the keep-alive
>> runs in the worker thread context but other commands (for instance async
>> read/write) runs in the user context and that uses the blk plugging. When we
>> use plugging, the call path for issuing request to driver is quite different
>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>> difference is probably the reason not causing the kernel crash issuing
>> read/write command while shutting down the fabric controller.
>>
>> The call path for blk plugging for issuing the request to driver:
>> __blk_flush_plug()
>>    -> blk_mq_flush_plug_list()
>>      ->blk_mq_dispatch_plug_list()
>>
>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>> function) is protected from the queue being destroyed on other cpu thread.
>> So that means that even after a request (assuming the only request in queue)
>> is cancelled or flushed off due to fabric controller is shutting down, the
>> blk-mq destroy code path would not be able to forward progress until the
>> ->q_usage_counter is decremented and becomes zero. And we can see in
>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>> the blk_mq_run_hw_queue function returns.
> 
> Sounds like an issue that is not related to keep-alive at all.
> 
>>>> Call Trace:
>>>>       autoremove_wake_function+0x0/0xbc (unreliable)
>>>>       __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>       blk_mq_sched_dispatch_requests+0x44/0x84
>>>>       blk_mq_run_hw_queue+0x140/0x220
>>>>       nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>       process_one_work+0x200/0x4e0
>>>>       worker_thread+0x340/0x504
>>>>       kthread+0x138/0x140
>>>>       start_kernel_thread+0x14/0x18
>>> What is the "crash" here? use-after-free?
>> yes it's use-after-free kernel crash.
>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>> then invoked to handle the end of the keep-alive operation which
>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>> queue()) which could be potentially running simultaneously on another
>>>> cpu (as this is the controller shutdown code path) would forward
>>>> progress and deletes the admin queue. So, now from this point onward
>>>> we are not supposed to access the admin queue resources. However the
>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>> dispatch operation hasn't yet finished its work and so it could still
>>>> potentially access the admin queue resource while the admin queue had
>>>> been already deleted and that causes the above crash.
>>> Again, it is unclear to me what justifies the cure here. Every command can
>>> race with the shutdown, and it can be sync or async.
>>>
>> While I was researching on this issue, I figured that the nvme keep-alive is
>> probably the only async request running in worker thread context and not well
>> synchronized with the fabric controller shutdown code path. Yes we may have other
>> async request (for instance async read/write) which might get-in the way of the
>> fabric controller shutdown code path but those mostly run in the user context and
>> hence have a different code path than keep-alive command for submitting/queuing
>> the request.
> 
> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
> from a kthread?
I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed 
but the side effect it causes is what I just wanted to highlight. 
> 
>>
>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>> after keep-alive command finished its execution and returns the command
>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>> keep-alive request finished execution and also keep-alive thread is not
>>>> running hw/hctx queue dispatch operation.
>>> I'd want to see exactly what is causing this race because we already flush
>>> the keep alive work when starting the shutdown... so it only means it is
>>> incorrectly firing again?
>>>
>>> What I think we should do instead, is given that we already cancel_sync the keep
>>> alive work before making progress with the teardown, we should just make sure
>>> that it doesn't triggered afterwards again if it does.
>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>> allocated but before it's queued,
> 
> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
> 
> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
> something doesn't add up
> 
This issue manifests when the keep-alive command sneaks in after the admin queue is 
unquiesced but before its freezed.

nvme_unquiesce_admin_queue() <<< we unquiesce admin queue   

<<< here keep-alive sneaks in and it starts running dispatcher

nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero 

So as mentioned above, we have time window between queue unquiesce and freeze 
operations where async keep-alive could potentially sneaks in and causing the 
observed symptom.

Thanks,
--Nilay
  


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-22  6:54         ` Nilay Shroff
@ 2024-10-22  9:43           ` Sagi Grimberg
  2024-10-22 13:28             ` Nilay Shroff
  0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2024-10-22  9:43 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce




On 22/10/2024 9:54, Nilay Shroff wrote:
>
> On 10/21/24 20:44, Sagi Grimberg wrote:
>>
>>
>> On 21/10/2024 14:58, Nilay Shroff wrote:
>>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>>> Hey Nilay,
>>>>
>>>> First, can you please phrase the patch title to describe what it
>>>> is addressing, not what it is changing, that can come (as it does)
>>>> in the body.
>>>>
>>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>>
>>>> Second, I don't think we want to spend a thread waiting for
>>>> synchronous IO for the purpose of keep alive. Is this the only
>>>> way that this can be done?
>>>>
>>>> Third, was this observed with non-loop transports?
>>>>
>>> Yes it could happen even with non-loop fabric transports.
>>>
>>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>> This may lead to a race between the fabric controller admin queue
>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>> operation. This race could lead to the kernel crash shown below:
>>>> What is different about keep-alive than any other command?
>>> The one difference is, of course, keep-alive is async but we may have
>>> other async requests running while shutting down the controller. I found
>>> that, when compared the keep-alive with other commands, the keep-alive
>>> runs in the worker thread context but other commands (for instance async
>>> read/write) runs in the user context and that uses the blk plugging. When we
>>> use plugging, the call path for issuing request to driver is quite different
>>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>>> difference is probably the reason not causing the kernel crash issuing
>>> read/write command while shutting down the fabric controller.
>>>
>>> The call path for blk plugging for issuing the request to driver:
>>> __blk_flush_plug()
>>>     -> blk_mq_flush_plug_list()
>>>       ->blk_mq_dispatch_plug_list()
>>>
>>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>>> function) is protected from the queue being destroyed on other cpu thread.
>>> So that means that even after a request (assuming the only request in queue)
>>> is cancelled or flushed off due to fabric controller is shutting down, the
>>> blk-mq destroy code path would not be able to forward progress until the
>>> ->q_usage_counter is decremented and becomes zero. And we can see in
>>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>>> the blk_mq_run_hw_queue function returns.
>> Sounds like an issue that is not related to keep-alive at all.
>>
>>>>> Call Trace:
>>>>>        autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>        __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>        blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>        blk_mq_run_hw_queue+0x140/0x220
>>>>>        nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>        process_one_work+0x200/0x4e0
>>>>>        worker_thread+0x340/0x504
>>>>>        kthread+0x138/0x140
>>>>>        start_kernel_thread+0x14/0x18
>>>> What is the "crash" here? use-after-free?
>>> yes it's use-after-free kernel crash.
>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>> then invoked to handle the end of the keep-alive operation which
>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>> queue()) which could be potentially running simultaneously on another
>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>> we are not supposed to access the admin queue resources. However the
>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>> potentially access the admin queue resource while the admin queue had
>>>>> been already deleted and that causes the above crash.
>>>> Again, it is unclear to me what justifies the cure here. Every command can
>>>> race with the shutdown, and it can be sync or async.
>>>>
>>> While I was researching on this issue, I figured that the nvme keep-alive is
>>> probably the only async request running in worker thread context and not well
>>> synchronized with the fabric controller shutdown code path. Yes we may have other
>>> async request (for instance async read/write) which might get-in the way of the
>>> fabric controller shutdown code path but those mostly run in the user context and
>>> hence have a different code path than keep-alive command for submitting/queuing
>>> the request.
>> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
>> from a kthread?
> I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed
> but the side effect it causes is what I just wanted to highlight.

The point is that any kthread using it would potentially trigger this 
bug. Hence we should
treat the disease and not the symptom.

>>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>>> after keep-alive command finished its execution and returns the command
>>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>>> keep-alive request finished execution and also keep-alive thread is not
>>>>> running hw/hctx queue dispatch operation.
>>>> I'd want to see exactly what is causing this race because we already flush
>>>> the keep alive work when starting the shutdown... so it only means it is
>>>> incorrectly firing again?
>>>>
>>>> What I think we should do instead, is given that we already cancel_sync the keep
>>>> alive work before making progress with the teardown, we should just make sure
>>>> that it doesn't triggered afterwards again if it does.
>>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>>> allocated but before it's queued,
>> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
>> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
>>
>> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
>> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
>> something doesn't add up
>>
> This issue manifests when the keep-alive command sneaks in after the admin queue is
> unquiesced but before its freezed.
>
> nvme_unquiesce_admin_queue() <<< we unquiesce admin queue
>
> <<< here keep-alive sneaks in and it starts running dispatcher
>
> nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero
>
> So as mentioned above, we have time window between queue unquiesce and freeze
> operations where async keep-alive could potentially sneaks in and causing the
> observed symptom.

The problem here is that keep-alive should not be able to sneak in after 
we unquiescence
the admin queue.

What I'm seeing in the code is that the teardown flow does not call 
nvme_stop_keep_alive()
in the teardown path.

This seems to be caused by [1], so it makes sense that we can even 
trigger this race.
If nvme_stop_keep_alive would have been called from nvme_stop_ctrl() 
which is called
way before we quiesce/unquiesce the queue, then this issue would not 
have been encountered.

I think we could make an argument that both call-sites to 
nvme_stop_keep_alive() should co-exist
with a little documentation to why.
Can you try and reproduce with [2]?


[1]:
--
commit a54a93d0e3599b05856971734e15418ac551a14c
Author: Ming Lei <ming.lei@redhat.com>
Date:   Tue Aug 13 09:35:27 2024 +0800

     nvme: move stopping keep-alive into nvme_uninit_ctrl()

     Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
     moves starting keep-alive from nvme_start_ctrl() into
     nvme_init_ctrl_finish(), but don't move stopping keep-alive into
     nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
     after failing to start controller, finally use-after-free is 
triggered if
     nvme host driver is unloaded.

     This patch fixes kernel panic when running nvme/004 in case that 
connection
     failure is triggered, by moving stopping keep-alive into 
nvme_uninit_ctrl().

     This way is reasonable because keep-alive is now started in
     nvme_init_ctrl_finish().

     Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to 
original position")
     Cc: Hannes Reinecke <hare@suse.de>
     Cc: Mark O'Donovan <shiftee@posteo.net>
     Reported-by: Changhui Zhong <czhong@redhat.com>
     Reviewed-by: Christoph Hellwig <hch@lst.de>
     Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
     Reviewed-by: Hannes Reinecke <hare@suse.de>
     Signed-off-by: Ming Lei <ming.lei@redhat.com>
     Signed-off-by: Keith Busch <kbusch@kernel.org>
--

This patch was introduced as a fix for:
3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")

[2]:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0dc8bcc664f2..275af23dda6d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4612,6 +4612,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
  {
         nvme_mpath_stop(ctrl);
         nvme_auth_stop(ctrl);
+       /*
+        * the transport driver may be terminating the admin tagset a little
+        * later on, so we cannot have the keep-alive work running
+        */
+       nvme_stop_keep_alive(ctrl);
         nvme_stop_failfast_work(ctrl);
         flush_work(&ctrl->async_event_work);
         cancel_work_sync(&ctrl->fw_act_work);
--


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-22  9:43           ` Sagi Grimberg
@ 2024-10-22 13:28             ` Nilay Shroff
  2024-10-23  9:40               ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2024-10-22 13:28 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce



On 10/22/24 15:13, Sagi Grimberg wrote:
> 
> 
> 
> On 22/10/2024 9:54, Nilay Shroff wrote:
>>
>> On 10/21/24 20:44, Sagi Grimberg wrote:
>>>
>>>
>>> On 21/10/2024 14:58, Nilay Shroff wrote:
>>>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>>>> Hey Nilay,
>>>>>
>>>>> First, can you please phrase the patch title to describe what it
>>>>> is addressing, not what it is changing, that can come (as it does)
>>>>> in the body.
>>>>>
>>>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>>>
>>>>> Second, I don't think we want to spend a thread waiting for
>>>>> synchronous IO for the purpose of keep alive. Is this the only
>>>>> way that this can be done?
>>>>>
>>>>> Third, was this observed with non-loop transports?
>>>>>
>>>> Yes it could happen even with non-loop fabric transports.
>>>>
>>>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>>> This may lead to a race between the fabric controller admin queue
>>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>>> operation. This race could lead to the kernel crash shown below:
>>>>> What is different about keep-alive than any other command?
>>>> The one difference is, of course, keep-alive is async but we may have
>>>> other async requests running while shutting down the controller. I found
>>>> that, when compared the keep-alive with other commands, the keep-alive
>>>> runs in the worker thread context but other commands (for instance async
>>>> read/write) runs in the user context and that uses the blk plugging. When we
>>>> use plugging, the call path for issuing request to driver is quite different
>>>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>>>> difference is probably the reason not causing the kernel crash issuing
>>>> read/write command while shutting down the fabric controller.
>>>>
>>>> The call path for blk plugging for issuing the request to driver:
>>>> __blk_flush_plug()
>>>>     -> blk_mq_flush_plug_list()
>>>>       ->blk_mq_dispatch_plug_list()
>>>>
>>>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>>>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>>>> function) is protected from the queue being destroyed on other cpu thread.
>>>> So that means that even after a request (assuming the only request in queue)
>>>> is cancelled or flushed off due to fabric controller is shutting down, the
>>>> blk-mq destroy code path would not be able to forward progress until the
>>>> ->q_usage_counter is decremented and becomes zero. And we can see in
>>>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>>>> the blk_mq_run_hw_queue function returns.
>>> Sounds like an issue that is not related to keep-alive at all.
>>>
>>>>>> Call Trace:
>>>>>>        autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>>        __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>>        blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>>        blk_mq_run_hw_queue+0x140/0x220
>>>>>>        nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>>        process_one_work+0x200/0x4e0
>>>>>>        worker_thread+0x340/0x504
>>>>>>        kthread+0x138/0x140
>>>>>>        start_kernel_thread+0x14/0x18
>>>>> What is the "crash" here? use-after-free?
>>>> yes it's use-after-free kernel crash.
>>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>>> then invoked to handle the end of the keep-alive operation which
>>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>>> queue()) which could be potentially running simultaneously on another
>>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>>> we are not supposed to access the admin queue resources. However the
>>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>>> potentially access the admin queue resource while the admin queue had
>>>>>> been already deleted and that causes the above crash.
>>>>> Again, it is unclear to me what justifies the cure here. Every command can
>>>>> race with the shutdown, and it can be sync or async.
>>>>>
>>>> While I was researching on this issue, I figured that the nvme keep-alive is
>>>> probably the only async request running in worker thread context and not well
>>>> synchronized with the fabric controller shutdown code path. Yes we may have other
>>>> async request (for instance async read/write) which might get-in the way of the
>>>> fabric controller shutdown code path but those mostly run in the user context and
>>>> hence have a different code path than keep-alive command for submitting/queuing
>>>> the request.
>>> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
>>> from a kthread?
>> I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed
>> but the side effect it causes is what I just wanted to highlight.
> 
> The point is that any kthread using it would potentially trigger this bug. Hence we should
> treat the disease and not the symptom.
> 
>>>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>>>> after keep-alive command finished its execution and returns the command
>>>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>>>> keep-alive request finished execution and also keep-alive thread is not
>>>>>> running hw/hctx queue dispatch operation.
>>>>> I'd want to see exactly what is causing this race because we already flush
>>>>> the keep alive work when starting the shutdown... so it only means it is
>>>>> incorrectly firing again?
>>>>>
>>>>> What I think we should do instead, is given that we already cancel_sync the keep
>>>>> alive work before making progress with the teardown, we should just make sure
>>>>> that it doesn't triggered afterwards again if it does.
>>>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>>>> allocated but before it's queued,
>>> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
>>> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
>>>
>>> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
>>> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
>>> something doesn't add up
>>>
>> This issue manifests when the keep-alive command sneaks in after the admin queue is
>> unquiesced but before its freezed.
>>
>> nvme_unquiesce_admin_queue() <<< we unquiesce admin queue
>>
>> <<< here keep-alive sneaks in and it starts running dispatcher
>>
>> nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero
>>
>> So as mentioned above, we have time window between queue unquiesce and freeze
>> operations where async keep-alive could potentially sneaks in and causing the
>> observed symptom.
> 
> The problem here is that keep-alive should not be able to sneak in after we unquiescence
> the admin queue.
> 
> What I'm seeing in the code is that the teardown flow does not call nvme_stop_keep_alive()
> in the teardown path.
> 
> This seems to be caused by [1], so it makes sense that we can even trigger this race.
> If nvme_stop_keep_alive would have been called from nvme_stop_ctrl() which is called
> way before we quiesce/unquiesce the queue, then this issue would not have been encountered.
> 
> I think we could make an argument that both call-sites to nvme_stop_keep_alive() should co-exist
> with a little documentation to why.
> Can you try and reproduce with [2]?
> 
> 
> [1]:
> -- 
> commit a54a93d0e3599b05856971734e15418ac551a14c
> Author: Ming Lei <ming.lei@redhat.com>
> Date:   Tue Aug 13 09:35:27 2024 +0800
> 
>     nvme: move stopping keep-alive into nvme_uninit_ctrl()
> 
>     Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
>     moves starting keep-alive from nvme_start_ctrl() into
>     nvme_init_ctrl_finish(), but don't move stopping keep-alive into
>     nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
>     after failing to start controller, finally use-after-free is triggered if
>     nvme host driver is unloaded.
> 
>     This patch fixes kernel panic when running nvme/004 in case that connection
>     failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl().
> 
>     This way is reasonable because keep-alive is now started in
>     nvme_init_ctrl_finish().
> 
>     Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
>     Cc: Hannes Reinecke <hare@suse.de>
>     Cc: Mark O'Donovan <shiftee@posteo.net>
>     Reported-by: Changhui Zhong <czhong@redhat.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>     Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>     Reviewed-by: Hannes Reinecke <hare@suse.de>
>     Signed-off-by: Ming Lei <ming.lei@redhat.com>
>     Signed-off-by: Keith Busch <kbusch@kernel.org>
> -- 
> 
> This patch was introduced as a fix for:
> 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
> 
> [2]:
> -- 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0dc8bcc664f2..275af23dda6d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4612,6 +4612,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>  {
>         nvme_mpath_stop(ctrl);
>         nvme_auth_stop(ctrl);
> +       /*
> +        * the transport driver may be terminating the admin tagset a little
> +        * later on, so we cannot have the keep-alive work running
> +        */
> +       nvme_stop_keep_alive(ctrl);
>         nvme_stop_failfast_work(ctrl);
>         flush_work(&ctrl->async_event_work);
>         cancel_work_sync(&ctrl->fw_act_work);
> -- 
> 
I tried the above patch[2] and ran nvme/037 in a loop for quite a long time and 
confirmed that with this change I don't see the kernel crash. So I think we shall 
restore the above change. 

Also please note that the earlier changes where we made keep-alive synchronous 
operation has been already merged to the mainline/Linus' kernel tree. So should we 
create a new patch reverting the changes merged in commit d06923670b5a ("nvme: 
make keep-alive synchronous operation") and then on top of it add the above 
patch[2]? 

Moreover, I would adjust the new patch so that we can keep this change 599d9f3a10ee 
("nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function"). I think 
changes implemented in commit 599d9f3a10ee is a logical step forward to avoid using 
ctrl->lock and instead use helper nvme_ctrl_state for retrieving the controller state. 

What do you suggest? 

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] nvme: make keep-alive synchronous operation
  2024-10-22 13:28             ` Nilay Shroff
@ 2024-10-23  9:40               ` Sagi Grimberg
  0 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2024-10-23  9:40 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce




On 22/10/2024 16:28, Nilay Shroff wrote:
>
> On 10/22/24 15:13, Sagi Grimberg wrote:
>>
>>
>> On 22/10/2024 9:54, Nilay Shroff wrote:
>>> On 10/21/24 20:44, Sagi Grimberg wrote:
>>>>
>>>> On 21/10/2024 14:58, Nilay Shroff wrote:
>>>>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>>>>> Hey Nilay,
>>>>>>
>>>>>> First, can you please phrase the patch title to describe what it
>>>>>> is addressing, not what it is changing, that can come (as it does)
>>>>>> in the body.
>>>>>>
>>>>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>>>>
>>>>>> Second, I don't think we want to spend a thread waiting for
>>>>>> synchronous IO for the purpose of keep alive. Is this the only
>>>>>> way that this can be done?
>>>>>>
>>>>>> Third, was this observed with non-loop transports?
>>>>>>
>>>>> Yes it could happen even with non-loop fabric transports.
>>>>>
>>>>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>>>> This may lead to a race between the fabric controller admin queue
>>>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>>>> operation. This race could lead to the kernel crash shown below:
>>>>>> What is different about keep-alive than any other command?
>>>>> The one difference is, of course, keep-alive is async but we may have
>>>>> other async requests running while shutting down the controller. I found
>>>>> that, when compared the keep-alive with other commands, the keep-alive
>>>>> runs in the worker thread context but other commands (for instance async
>>>>> read/write) runs in the user context and that uses the blk plugging. When we
>>>>> use plugging, the call path for issuing request to driver is quite different
>>>>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>>>>> difference is probably the reason not causing the kernel crash issuing
>>>>> read/write command while shutting down the fabric controller.
>>>>>
>>>>> The call path for blk plugging for issuing the request to driver:
>>>>> __blk_flush_plug()
>>>>>      -> blk_mq_flush_plug_list()
>>>>>        ->blk_mq_dispatch_plug_list()
>>>>>
>>>>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>>>>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>>>>> function) is protected from the queue being destroyed on other cpu thread.
>>>>> So that means that even after a request (assuming the only request in queue)
>>>>> is cancelled or flushed off due to fabric controller is shutting down, the
>>>>> blk-mq destroy code path would not be able to forward progress until the
>>>>> ->q_usage_counter is decremented and becomes zero. And we can see in
>>>>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>>>>> the blk_mq_run_hw_queue function returns.
>>>> Sounds like an issue that is not related to keep-alive at all.
>>>>
>>>>>>> Call Trace:
>>>>>>>         autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>>>         __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>>>         blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>>>         blk_mq_run_hw_queue+0x140/0x220
>>>>>>>         nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>>>         process_one_work+0x200/0x4e0
>>>>>>>         worker_thread+0x340/0x504
>>>>>>>         kthread+0x138/0x140
>>>>>>>         start_kernel_thread+0x14/0x18
>>>>>> What is the "crash" here? use-after-free?
>>>>> yes it's use-after-free kernel crash.
>>>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>>>> then invoked to handle the end of the keep-alive operation which
>>>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>>>> queue()) which could be potentially running simultaneously on another
>>>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>>>> we are not supposed to access the admin queue resources. However the
>>>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>>>> potentially access the admin queue resource while the admin queue had
>>>>>>> been already deleted and that causes the above crash.
>>>>>> Again, it is unclear to me what justifies the cure here. Every command can
>>>>>> race with the shutdown, and it can be sync or async.
>>>>>>
>>>>> While I was researching on this issue, I figured that the nvme keep-alive is
>>>>> probably the only async request running in worker thread context and not well
>>>>> synchronized with the fabric controller shutdown code path. Yes we may have other
>>>>> async request (for instance async read/write) which might get-in the way of the
>>>>> fabric controller shutdown code path but those mostly run in the user context and
>>>>> hence have a different code path than keep-alive command for submitting/queuing
>>>>> the request.
>>>> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
>>>> from a kthread?
>>> I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed
>>> but the side effect it causes is what I just wanted to highlight.
>> The point is that any kthread using it would potentially trigger this bug. Hence we should
>> treat the disease and not the symptom.
>>
>>>>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>>>>> after keep-alive command finished its execution and returns the command
>>>>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>>>>> keep-alive request finished execution and also keep-alive thread is not
>>>>>>> running hw/hctx queue dispatch operation.
>>>>>> I'd want to see exactly what is causing this race because we already flush
>>>>>> the keep alive work when starting the shutdown... so it only means it is
>>>>>> incorrectly firing again?
>>>>>>
>>>>>> What I think we should do instead, is given that we already cancel_sync the keep
>>>>>> alive work before making progress with the teardown, we should just make sure
>>>>>> that it doesn't triggered afterwards again if it does.
>>>>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>>>>> allocated but before it's queued,
>>>> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
>>>> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
>>>>
>>>> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
>>>> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
>>>> something doesn't add up
>>>>
>>> This issue manifests when the keep-alive command sneaks in after the admin queue is
>>> unquiesced but before its freezed.
>>>
>>> nvme_unquiesce_admin_queue() <<< we unquiesce admin queue
>>>
>>> <<< here keep-alive sneaks in and it starts running dispatcher
>>>
>>> nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero
>>>
>>> So as mentioned above, we have time window between queue unquiesce and freeze
>>> operations where async keep-alive could potentially sneaks in and causing the
>>> observed symptom.
>> The problem here is that keep-alive should not be able to sneak in after we unquiescence
>> the admin queue.
>>
>> What I'm seeing in the code is that the teardown flow does not call nvme_stop_keep_alive()
>> in the teardown path.
>>
>> This seems to be caused by [1], so it makes sense that we can even trigger this race.
>> If nvme_stop_keep_alive would have been called from nvme_stop_ctrl() which is called
>> way before we quiesce/unquiesce the queue, then this issue would not have been encountered.
>>
>> I think we could make an argument that both call-sites to nvme_stop_keep_alive() should co-exist
>> with a little documentation to why.
>> Can you try and reproduce with [2]?
>>
>>
>> [1]:
>> -- 
>> commit a54a93d0e3599b05856971734e15418ac551a14c
>> Author: Ming Lei <ming.lei@redhat.com>
>> Date:   Tue Aug 13 09:35:27 2024 +0800
>>
>>      nvme: move stopping keep-alive into nvme_uninit_ctrl()
>>
>>      Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
>>      moves starting keep-alive from nvme_start_ctrl() into
>>      nvme_init_ctrl_finish(), but don't move stopping keep-alive into
>>      nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
>>      after failing to start controller, finally use-after-free is triggered if
>>      nvme host driver is unloaded.
>>
>>      This patch fixes kernel panic when running nvme/004 in case that connection
>>      failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl().
>>
>>      This way is reasonable because keep-alive is now started in
>>      nvme_init_ctrl_finish().
>>
>>      Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
>>      Cc: Hannes Reinecke <hare@suse.de>
>>      Cc: Mark O'Donovan <shiftee@posteo.net>
>>      Reported-by: Changhui Zhong <czhong@redhat.com>
>>      Reviewed-by: Christoph Hellwig <hch@lst.de>
>>      Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>>      Reviewed-by: Hannes Reinecke <hare@suse.de>
>>      Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>      Signed-off-by: Keith Busch <kbusch@kernel.org>
>> -- 
>>
>> This patch was introduced as a fix for:
>> 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
>>
>> [2]:
>> -- 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 0dc8bcc664f2..275af23dda6d 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4612,6 +4612,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>>   {
>>          nvme_mpath_stop(ctrl);
>>          nvme_auth_stop(ctrl);
>> +       /*
>> +        * the transport driver may be terminating the admin tagset a little
>> +        * later on, so we cannot have the keep-alive work running
>> +        */
>> +       nvme_stop_keep_alive(ctrl);
>>          nvme_stop_failfast_work(ctrl);
>>          flush_work(&ctrl->async_event_work);
>>          cancel_work_sync(&ctrl->fw_act_work);
>> -- 
>>
> I tried the above patch[2] and ran nvme/037 in a loop for quite a long time and
> confirmed that with this change I don't see the kernel crash. So I think we shall
> restore the above change.

Yes, and we need a "Fixes:" tag because it solves a regression caused by [1]

>
> Also please note that the earlier changes where we made keep-alive synchronous
> operation has been already merged to the mainline/Linus' kernel tree. So should we
> create a new patch reverting the changes merged in commit d06923670b5a ("nvme:
> make keep-alive synchronous operation") and then on top of it add the above
> patch[2]?

Yes, one revert patch and one patch for [2]

>
> Moreover, I would adjust the new patch so that we can keep this change 599d9f3a10ee
> ("nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function"). I think
> changes implemented in commit 599d9f3a10ee is a logical step forward to avoid using
> ctrl->lock and instead use helper nvme_ctrl_state for retrieving the controller state.

That is fine to keep.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-10-23 10:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-16  3:03 [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
2024-10-16  3:03 ` [PATCH v4 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
2024-10-20 21:37   ` Sagi Grimberg
2024-10-16  3:03 ` [PATCH v4 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
2024-10-16  7:59   ` Christoph Hellwig
2024-10-20 21:49   ` Sagi Grimberg
2024-10-21 11:58     ` Nilay Shroff
2024-10-21 15:14       ` Sagi Grimberg
2024-10-22  6:54         ` Nilay Shroff
2024-10-22  9:43           ` Sagi Grimberg
2024-10-22 13:28             ` Nilay Shroff
2024-10-23  9:40               ` Sagi Grimberg
2024-10-16  3:03 ` [PATCH v4 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function Nilay Shroff
2024-10-17 18:08 ` [PATCH v4 0/3] nvme: system fault while shutting down fabric controller Keith Busch
2024-10-20 21:51   ` Sagi Grimberg
2024-10-21 12:03     ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).