All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme: system fault while shutting down fabric controller
@ 2024-10-04 11:46 Nilay Shroff
  2024-10-04 11:46 ` [PATCH 1/2] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
  2024-10-04 11:46 ` [PATCH 2/2] nvme: make keep-alive synchronous operation Nilay Shroff
  0 siblings, 2 replies; 6+ messages in thread
From: Nilay Shroff @ 2024-10-04 11:46 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, chaitanyak, gjoyce, Nilay Shroff

We observed a kernel task hang up and a kernel crash while shutting down 
NVMe fabric controller. These issues were observed while running blktest
nvme/037. And this patch series address issues encountered while running 
this test.

We intermittently observe the blow kernel task hang while running the 
blktest nvme/037. This test setup NVMeOF passthru controller using loop 
target, connect to it and then immediately terminate/cleanup the 
connection. 

dmesg output:
-------------
run blktests nvme/037 at 2024-10-04 00:46:02
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme1: D3 entry latency set to 10 seconds
nvme nvme1: creating 32 I/O queues.
nvme nvme1: new ctrl: "blktests-subsystem-1"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: resetting controller
INFO: task nvme:3082 blocked for more than 120 seconds.
      Not tainted 6.11.0+ #89
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D stack:0     pid:3082  tgid:3082  ppid:2983   flags:0x00042080
Call Trace:
    0xc000000070f5bf90 (unreliable)
    __switch_to+0x148/0x230
    __schedule+0x260/0x6dc
    schedule+0x40/0x100
    blk_mq_freeze_queue_wait+0xa4/0xec
    blk_mq_destroy_queue+0x68/0xac
    nvme_remove_admin_tag_set+0x2c/0xb8 [nvme_core]
    nvme_loop_destroy_admin_queue+0x68/0x88 [nvme_loop]
    nvme_do_delete_ctrl+0x1e0/0x268 [nvme_core]
    nvme_delete_ctrl_sync+0xd4/0x104 [nvme_core]
    nvme_sysfs_delete+0x78/0x90 [nvme_core]
    dev_attr_store+0x34/0x50
    sysfs_kf_write+0x64/0x78
    kernfs_fop_write_iter+0x1b0/0x290
    vfs_write+0x3bc/0x4f8
    ksys_write+0x84/0x140
    system_call_exception+0x124/0x320
    system_call_vectored_common+0x15c/0x2ec

As we can see from the above trace that nvme task hangs up indefinitely
while shutting down loop controller. This task couldn't forward progress 
because it's waiting for unfinished/outstanding requests which haven't 
yet finished. 

The first patch in the series fixes the above hang by ensuring that while
shutting down nvme loop controller, we flush off any pending I/O to the 
completion, which might have been queued after that queue has been quiesced. 
So the first patch adds a missing unquiesce admin and IO queue operation in 
the nvme loop driver just before the respective queue is destroyed.

The second patch in the series fixes another issue with the nvme keep-alive 
operation. The keep-alive operation could potentially sneak in while 
the fabric controller is shutting down. We encounter the below intermittent 
kernel crash while running blktest nvme/037:

dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP:  c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

The above crash occurred while shutting down fabric/loop controller.

While shutting down fabric controller, if nvme keep-alive request sneaks in 
and later flushed off then nvme_keep_alive_end_io() function is 
asynchronously invoked to handle the end of the keep-alive operation. The 
nvme_keep_alive_end_io() decrements the admin-queue-usage-ref-counter and 
assuming this is the last/only request in the admin queue then the admin 
queue-usage-ref-counter becomes zero. If that happens then blk-mq destroy 
queue operation (blk_mq_destroy_queue()) which could be potentially 
running simultaneously on another cpu (as this is the controller shutdown 
code path) would forward progress and deletes the admin queue. However at 
the same time nvme keep-alive thread running on another cpu hasn't yet 
returned/finished from it's async blk-mq request operation (i.e blk_execute_
rq_nowait()) and so it could still access admin queue resources which could 
have been already released from controller shutdown code path and that 
causes the observed symptom. 
For instance, find below the sequence of operations running simultaneously 
on two cpus and causing this issue:

cpu0:
nvme_keep_alive_work()
  ->blk_execute_rq_no_wait()
    ->blk_mq_run_hw_queue()
      ->blk_mq_sched_dispatch_requests()
        ->__blk_mq_sched_dispatch_requests()
          ->blk_mq_dispatch_rq_list()
            ->nvme_loop_queue_rq()
              ->nvme_fail_nonready_command() -- here keep-alive req fails because admin queue is shutting down
                ->nvme_complete_rq()
                  ->nvme_end_req()
                    ->blk_mq_end_request()
                      ->__blk_mq_end_request()
                        ->nvme_keep_alive_end_io() -- here we decrement admin-queue-usage-ref-counter
cpu1:
nvme_loop_delete_ctrl_host()
  ->nvme_loop_shutdown_ctrl()
    ->nvme_loop_destroy_admin_queue()
      ->nvme_remove_admin_tag_set()
        ->blk_mq_destroy_queue() -- here we wait until admin-queue-usage-ref-counter reches to zero 
        ->blk_put_queue() -- here we destroy queue once admin-queue-usage-ref-counter becomes zero
                          -- From here on we are not supposed to access admin queue
                             resources, however, nvme keep-alive thread running on 
                             cpu0 which is not yet finished and so may still access 
                             the admin qeueue pointer and causing the observed crash.

So prima facie, from the above trace it appears that the nvme keep-alive 
thread running on one cpu races with the shutdown controller operation 
which could be running on another cpu.

The second patch in the series addresses above issue by making nvme keep-
alive a synchronous operation so that we decrement admin-queue-usage-ref- 
counter only after keep-alive operation/command finish and returns status.
This would also ensure that blk_mq_destroy_queue() doesn't return until
the nvme keep-alive thread finish it's work and so it's safe to destroy the
queue.

Moreover, the Keep-alive command is lightweight and low-frequency, making 
it a synchronous approach shall be reasonable from a performance 
perspective. Since this command is not frequent compared to other NVMe 
operations (like reads/writes), it does not introduce a significant 
performance overhead when handled synchronously.

Nilay Shroff (2):
  nvme-loop: flush off pending I/O while shutting down loop controller
  nvme: make keep-alive synchronous operation

 drivers/nvme/host/core.c   | 25 ++++++++++---------------
 drivers/nvme/target/loop.c | 13 +++++++++++++
 2 files changed, 23 insertions(+), 15 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] nvme-loop: flush off pending I/O while shutting down loop controller
  2024-10-04 11:46 [PATCH 0/2] nvme: system fault while shutting down fabric controller Nilay Shroff
@ 2024-10-04 11:46 ` Nilay Shroff
  2024-10-07  6:37   ` Christoph Hellwig
  2024-10-04 11:46 ` [PATCH 2/2] nvme: make keep-alive synchronous operation Nilay Shroff
  1 sibling, 1 reply; 6+ messages in thread
From: Nilay Shroff @ 2024-10-04 11:46 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, chaitanyak, gjoyce, Nilay Shroff

While shutting down loop controller, we first quiesce the admin/IO queue,
delete the admin/IO tag-set and then at last destroy the admin/IO queue.
However it's quite possible that during the window between quiescing and
destroying of the admin/IO queue, some admin/IO request might sneak in
and if that happens then we could potentially encounter a hung task
because shutdown operation can't forward progress until any pending I/O
is flushed off.

This commit helps ensure that before destroying the admin/IO queue, we
unquiesce the admin/IO queue so that any outstanding requests, which are
added after the admin/IO queue is quiesced, are now flushed to its
completion.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/target/loop.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index e32790d8fc26..a9d112d34d4f 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -265,6 +265,13 @@ static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl)
 {
 	if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags))
 		return;
+	/*
+	 * It's possible that some requests might have been added
+	 * after admin queue is stopped/quiesced. So now start the
+	 * queue to flush these requests to the completion.
+	 */
+	nvme_unquiesce_admin_queue(&ctrl->ctrl);
+
 	nvmet_sq_destroy(&ctrl->queues[0].nvme_sq);
 	nvme_remove_admin_tag_set(&ctrl->ctrl);
 }
@@ -297,6 +304,12 @@ static void nvme_loop_destroy_io_queues(struct nvme_loop_ctrl *ctrl)
 		nvmet_sq_destroy(&ctrl->queues[i].nvme_sq);
 	}
 	ctrl->ctrl.queue_count = 1;
+	/*
+	 * It's possible that some requests might have been added
+	 * after io queue is stopped/quiesced. So now start the
+	 * queue to flush these requests to the completion.
+	 */
+	nvme_unquiesce_io_queues(&ctrl->ctrl);
 }
 
 static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] nvme: make keep-alive synchronous operation
  2024-10-04 11:46 [PATCH 0/2] nvme: system fault while shutting down fabric controller Nilay Shroff
  2024-10-04 11:46 ` [PATCH 1/2] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
@ 2024-10-04 11:46 ` Nilay Shroff
  2024-10-07  6:41   ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Nilay Shroff @ 2024-10-04 11:46 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, chaitanyak, gjoyce, Nilay Shroff

The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (while shutting down controller) and the blk-mq
hw/hctx queuing from the keep-alive thread.

This fix helps avoid race by implementing keep-alive as a synchronous
operation so that admin queue-usage ref counter is decremented only
after keep-alive command finish execution and returns its status. This
would ensure that we don't inadvertently destroy the fabric admin queue
until we finish processing of nvme keep-alive request and its status and
hence it's safe to delete the queue.

Also, while we are at it, instead of first acquiring ctrl lock and then
accessing NVMe controller state, lets use the helper function
nvme_ctrl_state() in nvme_keep_alive_end_io() and get rid of the
lock.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 02897f0564a3..5a690cf16e5e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1292,14 +1292,14 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
-static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
-						 blk_status_t status)
+static void nvme_keep_alive_finish(struct request *rq,
+				blk_status_t status,
+				struct nvme_ctrl *ctrl)
 {
-	struct nvme_ctrl *ctrl = rq->end_io_data;
-	unsigned long flags;
 	bool startka = false;
 	unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
 	unsigned long delay = nvme_keep_alive_work_period(ctrl);
+	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
 
 	/*
 	 * Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1313,25 +1313,19 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		delay = 0;
 	}
 
-	blk_mq_free_request(rq);
-
 	if (status) {
 		dev_err(ctrl->device,
 			"failed nvme_keep_alive_end_io error=%d\n",
 				status);
-		return RQ_END_IO_NONE;
+		return;
 	}
 
 	ctrl->ka_last_check_time = jiffies;
 	ctrl->comp_seen = false;
-	spin_lock_irqsave(&ctrl->lock, flags);
-	if (ctrl->state == NVME_CTRL_LIVE ||
-	    ctrl->state == NVME_CTRL_CONNECTING)
+	if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
 		startka = true;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
 	if (startka)
 		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
-	return RQ_END_IO_NONE;
 }
 
 static void nvme_keep_alive_work(struct work_struct *work)
@@ -1340,6 +1334,7 @@ static void nvme_keep_alive_work(struct work_struct *work)
 			struct nvme_ctrl, ka_work);
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
+	blk_status_t status;
 
 	ctrl->ka_last_check_time = jiffies;
 
@@ -1362,9 +1357,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	nvme_init_request(rq, &ctrl->ka_cmd);
 
 	rq->timeout = ctrl->kato * HZ;
-	rq->end_io = nvme_keep_alive_end_io;
-	rq->end_io_data = ctrl;
-	blk_execute_rq_nowait(rq, false);
+	status = blk_execute_rq(rq, false);
+	nvme_keep_alive_finish(rq, status, ctrl);
+	blk_mq_free_request(rq);
 }
 
 static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] nvme-loop: flush off pending I/O while shutting down loop controller
  2024-10-04 11:46 ` [PATCH 1/2] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
@ 2024-10-07  6:37   ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2024-10-07  6:37 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, gjoyce

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] nvme: make keep-alive synchronous operation
  2024-10-04 11:46 ` [PATCH 2/2] nvme: make keep-alive synchronous operation Nilay Shroff
@ 2024-10-07  6:41   ` Christoph Hellwig
  2024-10-07  7:55     ` Nilay Shroff
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2024-10-07  6:41 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, gjoyce

On Fri, Oct 04, 2024 at 05:16:57PM +0530, Nilay Shroff wrote:
> The nvme keep-alive operation, which executes at a periodic interval,
> could potentially sneak in while shutting down a fabric controller.
> This may lead to a race between the fabric controller admin queue
> destroy code path (while shutting down controller) and the blk-mq
> hw/hctx queuing from the keep-alive thread.
> 
> This fix helps avoid race by implementing keep-alive as a synchronous
> operation so that admin queue-usage ref counter is decremented only
> after keep-alive command finish execution and returns its status.

With that you mean ->q_usage_counter?

Moving to synchronous submission and wasting a workqueue context for
that is a bit sad.  I think just removing the blk_mq_free_request call
from nvme_keep_alive_finish and returning RQ_END_IO_FREE instead
should have the same effect, or am I missing something?

> Also, while we are at it, instead of first acquiring ctrl lock and then
> accessing NVMe controller state, lets use the helper function
> nvme_ctrl_state() in nvme_keep_alive_end_io() and get rid of the
> lock.

Please split that into a separate patch.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] nvme: make keep-alive synchronous operation
  2024-10-07  6:41   ` Christoph Hellwig
@ 2024-10-07  7:55     ` Nilay Shroff
  0 siblings, 0 replies; 6+ messages in thread
From: Nilay Shroff @ 2024-10-07  7:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, kbusch, sagi, axboe, chaitanyak, gjoyce



On 10/7/24 12:11, Christoph Hellwig wrote:
> On Fri, Oct 04, 2024 at 05:16:57PM +0530, Nilay Shroff wrote:
>> The nvme keep-alive operation, which executes at a periodic interval,
>> could potentially sneak in while shutting down a fabric controller.
>> This may lead to a race between the fabric controller admin queue
>> destroy code path (while shutting down controller) and the blk-mq
>> hw/hctx queuing from the keep-alive thread.
>>
>> This fix helps avoid race by implementing keep-alive as a synchronous
>> operation so that admin queue-usage ref counter is decremented only
>> after keep-alive command finish execution and returns its status.
> 
> With that you mean ->q_usage_counter?
Yes I meant ->q_usage_counter.
> 
> Moving to synchronous submission and wasting a workqueue context for
> that is a bit sad.  I think just removing the blk_mq_free_request call
> from nvme_keep_alive_finish and returning RQ_END_IO_FREE instead
> should have the same effect, or am I missing something?
> 
Unfortunately, that would not help because we still fall through the same 
code path and here we're just moving blk_mq_free_request call from 
nvme_keep_alive_finish to its caller. For instance, assuming we keep 
nvme_keep_alive work asynchronous then the code path for invoking 
nvme_keep_alive_finish would be,

nvme_keep_alive_work()
  ->blk_execute_rq_no_wait()
    ->blk_mq_run_hw_queue()
      ->blk_mq_sched_dispatch_requests()
        ->__blk_mq_sched_dispatch_requests()
          ->blk_mq_dispatch_rq_list()
            ->nvme_loop_queue_rq()
              ->nvme_fail_nonready_command() 
                ->nvme_complete_rq()
                  ->nvme_end_req()
                    ->blk_mq_end_request()
                      ->__blk_mq_end_request()
                        ->nvme_keep_alive_finish() 

In the above call path, if nvme_keep_alive_finish returns RQ_END_IO_FREE (and we 
remove blk_mq_free_request call from it) then the caller __blk_mq_end_request would 
then invoke blk_mq_free_request and we may still hit the same bug because 
this would allow blk_mq_destroy_queue() (which may be running on another cpu from 
the controller shutdown code path) forward progress and next from blk_put_queue()
the admin queue resources are deleted.

>> Also, while we are at it, instead of first acquiring ctrl lock and then
>> accessing NVMe controller state, lets use the helper function
>> nvme_ctrl_state() in nvme_keep_alive_end_io() and get rid of the
>> lock.
> 
> Please split that into a separate patch.
Sure I would split the patch and resend the series. 

And thank you for your review comments!

--Nilay


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-10-07  8:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04 11:46 [PATCH 0/2] nvme: system fault while shutting down fabric controller Nilay Shroff
2024-10-04 11:46 ` [PATCH 1/2] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
2024-10-07  6:37   ` Christoph Hellwig
2024-10-04 11:46 ` [PATCH 2/2] nvme: make keep-alive synchronous operation Nilay Shroff
2024-10-07  6:41   ` Christoph Hellwig
2024-10-07  7:55     ` Nilay Shroff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.