* [PATCH v3 1/3] nvme-loop: flush off pending I/O while shutting down loop controller
2024-10-08 6:21 [PATCH v3 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
@ 2024-10-08 6:21 ` Nilay Shroff
2024-10-08 6:21 ` [PATCH v3 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
2024-10-08 6:21 ` [PATCH v3 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function Nilay Shroff
2 siblings, 0 replies; 7+ messages in thread
From: Nilay Shroff @ 2024-10-08 6:21 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
While shutting down loop controller, we first quiesce the admin/IO queue,
delete the admin/IO tag-set and then at last destroy the admin/IO queue.
However it's quite possible that during the window between quiescing and
destroying of the admin/IO queue, some admin/IO request might sneak in
and if that happens then we could potentially encounter a hung task
because shutdown operation can't forward progress until any pending I/O
is flushed off.
This commit helps ensure that before destroying the admin/IO queue, we
unquiesce the admin/IO queue so that any outstanding requests, which are
added after the admin/IO queue is quiesced, are now flushed to its
completion.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/target/loop.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index e32790d8fc26..a9d112d34d4f 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -265,6 +265,13 @@ static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl)
{
if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags))
return;
+ /*
+ * It's possible that some requests might have been added
+ * after admin queue is stopped/quiesced. So now start the
+ * queue to flush these requests to the completion.
+ */
+ nvme_unquiesce_admin_queue(&ctrl->ctrl);
+
nvmet_sq_destroy(&ctrl->queues[0].nvme_sq);
nvme_remove_admin_tag_set(&ctrl->ctrl);
}
@@ -297,6 +304,12 @@ static void nvme_loop_destroy_io_queues(struct nvme_loop_ctrl *ctrl)
nvmet_sq_destroy(&ctrl->queues[i].nvme_sq);
}
ctrl->ctrl.queue_count = 1;
+ /*
+ * It's possible that some requests might have been added
+ * after io queue is stopped/quiesced. So now start the
+ * queue to flush these requests to the completion.
+ */
+ nvme_unquiesce_io_queues(&ctrl->ctrl);
}
static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl)
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v3 2/3] nvme: make keep-alive synchronous operation
2024-10-08 6:21 [PATCH v3 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
2024-10-08 6:21 ` [PATCH v3 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
@ 2024-10-08 6:21 ` Nilay Shroff
2024-10-08 7:04 ` Christoph Hellwig
2024-10-08 6:21 ` [PATCH v3 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function Nilay Shroff
2 siblings, 1 reply; 7+ messages in thread
From: Nilay Shroff @ 2024-10-08 6:21 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (while shutting down controller) and the blk-mq
hw/hctx queuing from the keep-alive thread.
This fix helps avoid race by implementing keep-alive as a synchronous
operation so that admin queue-usage ref counter is decremented only
after keep-alive command finish execution and returns its status. This
would ensure that we don't inadvertently destroy the fabric admin queue
until we finish processing of nvme keep-alive request and its status and
hence it's safe to delete the queue.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 02897f0564a3..736adbf65ef5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1292,10 +1292,10 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
}
-static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
- blk_status_t status)
+static void nvme_keep_alive_finish(struct request *rq,
+ blk_status_t status,
+ struct nvme_ctrl *ctrl)
{
- struct nvme_ctrl *ctrl = rq->end_io_data;
unsigned long flags;
bool startka = false;
unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
@@ -1313,13 +1313,11 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
delay = 0;
}
- blk_mq_free_request(rq);
-
if (status) {
dev_err(ctrl->device,
"failed nvme_keep_alive_end_io error=%d\n",
status);
- return RQ_END_IO_NONE;
+ return;
}
ctrl->ka_last_check_time = jiffies;
@@ -1331,7 +1329,6 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
spin_unlock_irqrestore(&ctrl->lock, flags);
if (startka)
queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
- return RQ_END_IO_NONE;
}
static void nvme_keep_alive_work(struct work_struct *work)
@@ -1340,6 +1337,7 @@ static void nvme_keep_alive_work(struct work_struct *work)
struct nvme_ctrl, ka_work);
bool comp_seen = ctrl->comp_seen;
struct request *rq;
+ blk_status_t status;
ctrl->ka_last_check_time = jiffies;
@@ -1362,9 +1360,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
nvme_init_request(rq, &ctrl->ka_cmd);
rq->timeout = ctrl->kato * HZ;
- rq->end_io = nvme_keep_alive_end_io;
- rq->end_io_data = ctrl;
- blk_execute_rq_nowait(rq, false);
+ status = blk_execute_rq(rq, false);
+ nvme_keep_alive_finish(rq, status, ctrl);
+ blk_mq_free_request(rq);
}
static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 2/3] nvme: make keep-alive synchronous operation
2024-10-08 6:21 ` [PATCH v3 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
@ 2024-10-08 7:04 ` Christoph Hellwig
2024-10-08 17:46 ` Nilay Shroff
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2024-10-08 7:04 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Tue, Oct 08, 2024 at 11:51:51AM +0530, Nilay Shroff wrote:
> This fix helps avoid race by implementing keep-alive as a synchronous
> operation so that admin queue-usage ref counter is decremented only
Please spell out q_usage_counter as requested in the first round.
> after keep-alive command finish execution and returns its status. This
> would ensure that we don't inadvertently destroy the fabric admin queue
> until we finish processing of nvme keep-alive request and its status and
> hence it's safe to delete the queue.
I still fail to see why this requires a synchronous operation vs just
calling blk_mq_free_request and thus decrementing q_usage_counter
afrer checking the controller state.
Maybe I'm just dumb and missing the obvious even after the last
explanation, but then the commit log needs to be improved to explain
it.
> -static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
> - blk_status_t status)
> +static void nvme_keep_alive_finish(struct request *rq,
> + blk_status_t status,
> + struct nvme_ctrl *ctrl)
And as a nipick, this should be:
static void nvme_keep_alive_finish(struct request *rq, blk_status_t status,
struct nvme_ctrl *ctrl)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/3] nvme: make keep-alive synchronous operation
2024-10-08 7:04 ` Christoph Hellwig
@ 2024-10-08 17:46 ` Nilay Shroff
0 siblings, 0 replies; 7+ messages in thread
From: Nilay Shroff @ 2024-10-08 17:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, kbusch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 10/8/24 12:34, Christoph Hellwig wrote:
> On Tue, Oct 08, 2024 at 11:51:51AM +0530, Nilay Shroff wrote:
>> This fix helps avoid race by implementing keep-alive as a synchronous
>> operation so that admin queue-usage ref counter is decremented only
>
> Please spell out q_usage_counter as requested in the first round.
>
Yes sure, I will do it in the next patch revision.
>> after keep-alive command finish execution and returns its status. This
>> would ensure that we don't inadvertently destroy the fabric admin queue
>> until we finish processing of nvme keep-alive request and its status and
>> hence it's safe to delete the queue.
>
> I still fail to see why this requires a synchronous operation vs just
> calling blk_mq_free_request and thus decrementing q_usage_counter
> afrer checking the controller state.
>
> Maybe I'm just dumb and missing the obvious even after the last
> explanation, but then the commit log needs to be improved to explain
> it.
>
OK, I will update the commit log in the next patch revision.
BTW, I just tried experimenting with your suggestion of "removing the
blk_mq_free_request call from nvme_keep_alive_finish function and returning
RQ_END_IO_FREE instead of RQ_END_IO_NONE" and I could still hit the same issue.
The issue here's that after nvme_keep_alive_finish returns back up to the
block layer, the nvme keep-alive thread running the queue dispatcher operation
(and hence accessing the queue resources) while this queue might have been
destroyed on another cpu.
nvme_keep_alive_work()
->blk_execute_rq_no_wait()
->blk_mq_run_hw_queue()
->blk_mq_sched_dispatch_requests()
->__blk_mq_sched_dispatch_requests()
->blk_mq_dispatch_rq_list()
->nvme_loop_queue_rq()
->nvme_fail_nonready_command()
->nvme_complete_rq()
->nvme_end_req()
->blk_mq_end_request()
->__blk_mq_end_request() -- with your suggestion, we now decrement admin->q_usage_counter here
->nvme_keep_alive_finish()
When above call stack returns to __blk_mq_sched_dispatch_requests function,
the admin queue might have been destroyed on another cpu however the
__blk_mq_sched_dispatch_requests could still access the admin queue resources
and causing the crash as reported in the cover letter.
>
>> -static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>> - blk_status_t status)
>> +static void nvme_keep_alive_finish(struct request *rq,
>> + blk_status_t status,
>> + struct nvme_ctrl *ctrl)
>
> And as a nipick, this should be:
>
> static void nvme_keep_alive_finish(struct request *rq, blk_status_t status,
> struct nvme_ctrl *ctrl)
>
>
Yes will do it the next patch.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
2024-10-08 6:21 [PATCH v3 0/3] nvme: system fault while shutting down fabric controller Nilay Shroff
2024-10-08 6:21 ` [PATCH v3 1/3] nvme-loop: flush off pending I/O while shutting down loop controller Nilay Shroff
2024-10-08 6:21 ` [PATCH v3 2/3] nvme: make keep-alive synchronous operation Nilay Shroff
@ 2024-10-08 6:21 ` Nilay Shroff
2024-10-08 7:06 ` Christoph Hellwig
2 siblings, 1 reply; 7+ messages in thread
From: Nilay Shroff @ 2024-10-08 6:21 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
We no more need acquiring ctrl->lock before accessing the
NVMe controller state and instead we can now use the helper
nvme_ctrl_state. So replace the use of ctrl->lock from
nvme_keep_alive_finish function with nvme_ctrl_state call.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 736adbf65ef5..4c83424fbd66 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1296,10 +1296,9 @@ static void nvme_keep_alive_finish(struct request *rq,
blk_status_t status,
struct nvme_ctrl *ctrl)
{
- unsigned long flags;
- bool startka = false;
unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
unsigned long delay = nvme_keep_alive_work_period(ctrl);
+ enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
/*
* Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1322,12 +1321,7 @@ static void nvme_keep_alive_finish(struct request *rq,
ctrl->ka_last_check_time = jiffies;
ctrl->comp_seen = false;
- spin_lock_irqsave(&ctrl->lock, flags);
- if (ctrl->state == NVME_CTRL_LIVE ||
- ctrl->state == NVME_CTRL_CONNECTING)
- startka = true;
- spin_unlock_irqrestore(&ctrl->lock, flags);
- if (startka)
+ if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
2024-10-08 6:21 ` [PATCH v3 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function Nilay Shroff
@ 2024-10-08 7:06 ` Christoph Hellwig
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2024-10-08 7:06 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Tue, Oct 08, 2024 at 11:51:52AM +0530, Nilay Shroff wrote:
> We no more need acquiring ctrl->lock before accessing the
> NVMe controller state and instead we can now use the helper
> nvme_ctrl_state. So replace the use of ctrl->lock from
> nvme_keep_alive_finish function with nvme_ctrl_state call.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 7+ messages in thread