* [PATCH 0/3] nvme: fix system fault observed while shutting down controller
@ 2024-10-27 17:02 Nilay Shroff
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-10-27 17:02 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
Hi,
This patch series addresses the system fault observed while shutting
down fabric controller. We already fixed it[1] earlier however it was
later relaized that we do have a better and optimal way to address it
[2].
The first patch in the series reverts the changes implemented in [3] and
[4]. So essentially we're making keep-alive operation asynchronous again
as it was earlier.
The second patch in the series fix the kernel crash observed while
shutting down fabric controller.
The third patch in the series uses the nvme_ctrl_state function for
retrieving the controller state.
The system fault was observed due to the keep-alive request sneaking in
while shutting down fabric controller. We encounter the below intermittent
kernel crash while running blktest nvme/037:
dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP: c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
autoremove_wake_function+0x0/0xbc (unreliable)
__blk_mq_sched_dispatch_requests+0x114/0x24c
blk_mq_sched_dispatch_requests+0x44/0x84
blk_mq_run_hw_queue+0x140/0x220
nvme_keep_alive_work+0xc8/0x19c [nvme_core]
process_one_work+0x200/0x4e0
worker_thread+0x340/0x504
kthread+0x138/0x140
start_kernel_thread+0x14/0x18
We realized that the above crash is regression caused due to changes
implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
beggining of the controller shutdown code path so that it wouldn't sneak
in or interfere with the shutdown operation. However we removed the keep
alive stop operation from the beginning of the controller shutdown code
path in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_
uninit_ctrl()") and that now created the possibility of keep-alive
sneaking in and interfering with the shutdown operation and causing
observed kernel crash. So to fix this crash, now we're adding back the
keep-alive stop operation at very beginning of the fabric controller
shutdown code path so that the actual controller shutdown opeation only
begins after it's ensured that keep-alive operation is not in-flight and
also it can't be scheduled in future. This fixed in the second patch of
the series.
The third patch in the series addresses the use of ctrl->lock before
accessing NVMe controller state in nvme_keep_alive_end_io function.
With introduction of helper nvme_ctrl_state, we no longer need to
first acquire ctrl->lock before accessing the NVMe controller state.
So this patch removes the use of ctrl->lock from nvme_keep_alive_end_io
function and replaces it with helper nvme_ctrl_state call.
[1]https://lore.kernel.org/all/ZxFSkNI2p65ucTB5@kbusch-mbp.dhcp.thefacebook.com/
[2]https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
[3]https://lore.kernel.org/all/20241016030339.54029-3-nilay@linux.ibm.com/
[4]https://lore.kernel.org/all/20241016030339.54029-4-nilay@linux.ibm.com/
Nilay Shroff (3):
Revert "nvme: make keep-alive synchronous operation"
nvme-fabrics: fix kernel crash while shutting down controller
nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
drivers/nvme/host/core.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation"
2024-10-27 17:02 [PATCH 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
@ 2024-10-27 17:02 ` Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 6:48 ` Ming Lei
2024-10-27 17:02 ` [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
2024-10-27 17:02 ` [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
2 siblings, 2 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-10-27 17:02 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
This reverts commit 599d9f3a10eec69ef28a90161763e4bd7c9c02bf.
It was realized that the fix implemented to avoid the race
condition between keep alive task and the fabric shutdown code
path in the commit d06923670b5ia ("nvme: make keep-alive
synchronous operation") is not optimal.
We also found that the above race condition is regression caused
due to the changes implemented in commit a54a93d0e359 ("nvme: move
stopping keep-alive into nvme_uninit_ctrl()"). So we decided to
first revert the commit d06923670b5a ("nvme: make keep-alive
synchronous operation") and commit 599d9f3a10ee ("nvme: use helper
nvme_ctrl_state in nvme_keep_alive_finish function") and then fix
the regression.
Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d..5016f69e9a15 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1292,12 +1292,14 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
}
-static void nvme_keep_alive_finish(struct request *rq,
- blk_status_t status, struct nvme_ctrl *ctrl)
+static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
+ blk_status_t status)
{
+ struct nvme_ctrl *ctrl = rq->end_io_data;
+ unsigned long flags;
+ bool startka = false;
unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
unsigned long delay = nvme_keep_alive_work_period(ctrl);
- enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
/*
* Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1311,17 +1313,25 @@ static void nvme_keep_alive_finish(struct request *rq,
delay = 0;
}
+ blk_mq_free_request(rq);
+
if (status) {
dev_err(ctrl->device,
"failed nvme_keep_alive_end_io error=%d\n",
status);
- return;
+ return RQ_END_IO_NONE;
}
ctrl->ka_last_check_time = jiffies;
ctrl->comp_seen = false;
- if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
+ spin_lock_irqsave(&ctrl->lock, flags);
+ if (ctrl->state == NVME_CTRL_LIVE ||
+ ctrl->state == NVME_CTRL_CONNECTING)
+ startka = true;
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ if (startka)
queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
+ return RQ_END_IO_NONE;
}
static void nvme_keep_alive_work(struct work_struct *work)
@@ -1330,7 +1340,6 @@ static void nvme_keep_alive_work(struct work_struct *work)
struct nvme_ctrl, ka_work);
bool comp_seen = ctrl->comp_seen;
struct request *rq;
- blk_status_t status;
ctrl->ka_last_check_time = jiffies;
@@ -1353,9 +1362,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
nvme_init_request(rq, &ctrl->ka_cmd);
rq->timeout = ctrl->kato * HZ;
- status = blk_execute_rq(rq, false);
- nvme_keep_alive_finish(rq, status, ctrl);
- blk_mq_free_request(rq);
+ rq->end_io = nvme_keep_alive_end_io;
+ rq->end_io_data = ctrl;
+ blk_execute_rq_nowait(rq, false);
}
static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
--
2.45.2
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-27 17:02 [PATCH 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
@ 2024-10-27 17:02 ` Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 7:23 ` Ming Lei
2024-10-27 17:02 ` [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
2 siblings, 2 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-10-27 17:02 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:
Call Trace:
autoremove_wake_function+0x0/0xbc (unreliable)
__blk_mq_sched_dispatch_requests+0x114/0x24c
blk_mq_sched_dispatch_requests+0x44/0x84
blk_mq_run_hw_queue+0x140/0x220
nvme_keep_alive_work+0xc8/0x19c [nvme_core]
process_one_work+0x200/0x4e0
worker_thread+0x340/0x504
kthread+0x138/0x140
start_kernel_thread+0x14/0x18
While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.
The above kernel crash is regression caused due to changes implemented
in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
beggining of the controller shutdown code path so that it wouldn't
sneak in during the shutdown operation. However we removed the keep
alive stop operation from the beginning of the controller shutdown
code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()") and that now created the possibility of keep-alive
sneaking in and interfering with the shutdown operation and causing
observed kernel crash. So to fix this crash, now we're adding back the
keep-alive stop operation at very beginning of the fabric controller
shutdown code path so that the actual controller shutdown opeation only
begins after it's ensured that keep-alive operation is not in-flight and
also it can't be scheduled in future.
Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5016f69e9a15..865c00ea19e3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
{
nvme_mpath_stop(ctrl);
nvme_auth_stop(ctrl);
+ /*
+ * the transport driver may be terminating the admin tagset a little
+ * later on, so we cannot have the keep-alive work running
+ */
+ nvme_stop_keep_alive(ctrl);
nvme_stop_failfast_work(ctrl);
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fw_act_work);
--
2.45.2
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-27 17:02 [PATCH 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
2024-10-27 17:02 ` [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
@ 2024-10-27 17:02 ` Nilay Shroff
2024-10-27 22:07 ` Sagi Grimberg
2024-10-29 15:01 ` Keith Busch
2 siblings, 2 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-10-27 17:02 UTC (permalink / raw)
To: linux-nvme
Cc: kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce,
Nilay Shroff
We no more need acquiring ctrl->lock before accessing the
NVMe controller state and instead we can now use the helper
nvme_ctrl_state. So replace the use of ctrl->lock from
nvme_keep_alive_end_io function with nvme_ctrl_state call.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 865c00ea19e3..6e814daf1002 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1296,10 +1296,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
blk_status_t status)
{
struct nvme_ctrl *ctrl = rq->end_io_data;
- unsigned long flags;
- bool startka = false;
unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
unsigned long delay = nvme_keep_alive_work_period(ctrl);
+ enum nvme_ctrl_state state = nvme_ctrl_state(ctrl->state);
/*
* Subtract off the keepalive RTT so nvme_keep_alive_work runs
@@ -1324,13 +1323,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
ctrl->ka_last_check_time = jiffies;
ctrl->comp_seen = false;
- spin_lock_irqsave(&ctrl->lock, flags);
- if (ctrl->state == NVME_CTRL_LIVE ||
- ctrl->state == NVME_CTRL_CONNECTING)
- startka = true;
- spin_unlock_irqrestore(&ctrl->lock, flags);
- if (startka)
+ if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
+
return RQ_END_IO_NONE;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation"
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
@ 2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 6:48 ` Ming Lei
1 sibling, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2024-10-27 22:05 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-27 17:02 ` [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
@ 2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 7:23 ` Ming Lei
1 sibling, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2024-10-27 22:05 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-27 17:02 ` [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
@ 2024-10-27 22:07 ` Sagi Grimberg
2024-10-28 4:43 ` Nilay Shroff
2024-10-29 15:01 ` Keith Busch
1 sibling, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2024-10-27 22:07 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce
On 27/10/2024 19:02, Nilay Shroff wrote:
> We no more need acquiring ctrl->lock before accessing the
> NVMe controller state and instead we can now use the helper
> nvme_ctrl_state. So replace the use of ctrl->lock from
> nvme_keep_alive_end_io function with nvme_ctrl_state call.
The patch description is not clear. Why do we "no longer need"
to take the ctrl->lock? What exactly changed for this?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-27 22:07 ` Sagi Grimberg
@ 2024-10-28 4:43 ` Nilay Shroff
2024-10-29 14:58 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Nilay Shroff @ 2024-10-28 4:43 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme; +Cc: kbusch, hch, axboe, chaitanyak, dlemoal, gjoyce
On 10/28/24 03:37, Sagi Grimberg wrote:
>
>
>
> On 27/10/2024 19:02, Nilay Shroff wrote:
>> We no more need acquiring ctrl->lock before accessing the
>> NVMe controller state and instead we can now use the helper
>> nvme_ctrl_state. So replace the use of ctrl->lock from
>> nvme_keep_alive_end_io function with nvme_ctrl_state call.
>
> The patch description is not clear. Why do we "no longer need"
> to take the ctrl->lock? What exactly changed for this?
>
The update of ctrl->state is now done using WRITE_ONCE(). So it
makes sense to pair the read of ctrl->state using READ_ONCE(). And
the helper nvme_ctrl_state(), in fact, reads the ctrl->state
using READ_ONCE(). So, if we use the nvme_ctrl_state() for
retrieving ctrl->state value then we don't need locking.
I think I would better format the commit message of this patch
to describe the above intent.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation"
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
@ 2024-10-29 6:48 ` Ming Lei
2024-10-29 12:46 ` Nilay Shroff
1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-10-29 6:48 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Mon, Oct 28, 2024 at 1:03 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
> This reverts commit 599d9f3a10eec69ef28a90161763e4bd7c9c02bf.
>
> It was realized that the fix implemented to avoid the race
> condition between keep alive task and the fabric shutdown code
> path in the commit d06923670b5ia ("nvme: make keep-alive
> synchronous operation") is not optimal.
I saw you have discussed it a while, but it is still better to describe
the reason in the commit log.
>
> We also found that the above race condition is regression caused
> due to the changes implemented in commit a54a93d0e359 ("nvme: move
> stopping keep-alive into nvme_uninit_ctrl()"). So we decided to
Can you explain a bit why commit a54a93d0e359 is a regression?
And what is the race condition?
Without providing the context info, it is hard to review the change.
Thanks,
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-27 17:02 ` [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
@ 2024-10-29 7:23 ` Ming Lei
2024-10-29 12:40 ` Nilay Shroff
1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-10-29 7:23 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
> The nvme keep-alive operation, which executes at a periodic interval,
> could potentially sneak in while shutting down a fabric controller.
> This may lead to a race between the fabric controller admin queue
> destroy code path (invoked while shutting down controller) and hw/hctx
> queue dispatcher called from the nvme keep-alive async request queuing
> operation. This race could lead to the kernel crash shown below:
>
> Call Trace:
> autoremove_wake_function+0x0/0xbc (unreliable)
> __blk_mq_sched_dispatch_requests+0x114/0x24c
> blk_mq_sched_dispatch_requests+0x44/0x84
> blk_mq_run_hw_queue+0x140/0x220
> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
> process_one_work+0x200/0x4e0
> worker_thread+0x340/0x504
> kthread+0x138/0x140
> start_kernel_thread+0x14/0x18
>
> While shutting down fabric controller, if nvme keep-alive request sneaks
> in then it would be flushed off. The nvme_keep_alive_end_io function is
> then invoked to handle the end of the keep-alive operation which
> decrements the admin->q_usage_counter and assuming this is the last/only
> request in the admin queue then the admin->q_usage_counter becomes zero.
> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> queue()) which could be potentially running simultaneously on another
> cpu (as this is the controller shutdown code path) would forward
> progress and deletes the admin queue. So, now from this point onward
> we are not supposed to access the admin queue resources. However the
> issue here's that the nvme keep-alive thread running hw/hctx queue
> dispatch operation hasn't yet finished its work and so it could still
> potentially access the admin queue resource while the admin queue had
> been already deleted and that causes the above crash.
>
> The above kernel crash is regression caused due to changes implemented
> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
> beggining of the controller shutdown code path so that it wouldn't
> sneak in during the shutdown operation. However we removed the keep
> alive stop operation from the beginning of the controller shutdown
> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
> sneaking in and interfering with the shutdown operation and causing
> observed kernel crash. So to fix this crash, now we're adding back the
> keep-alive stop operation at very beginning of the fabric controller
> shutdown code path so that the actual controller shutdown opeation only
> begins after it's ensured that keep-alive operation is not in-flight and
> also it can't be scheduled in future.
>
> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 5016f69e9a15..865c00ea19e3 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> {
> nvme_mpath_stop(ctrl);
> nvme_auth_stop(ctrl);
> + /*
> + * the transport driver may be terminating the admin tagset a little
> + * later on, so we cannot have the keep-alive work running
> + */
> + nvme_stop_keep_alive(ctrl);
> nvme_stop_failfast_work(ctrl);
> flush_work(&ctrl->async_event_work);
> cancel_work_sync(&ctrl->fw_act_work);
The change looks fine.
IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
can be saved?
thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-29 7:23 ` Ming Lei
@ 2024-10-29 12:40 ` Nilay Shroff
2024-10-30 2:20 ` Ming Lei
0 siblings, 1 reply; 21+ messages in thread
From: Nilay Shroff @ 2024-10-29 12:40 UTC (permalink / raw)
To: Ming Lei
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 10/29/24 12:53, Ming Lei wrote:
> On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
>> The nvme keep-alive operation, which executes at a periodic interval,
>> could potentially sneak in while shutting down a fabric controller.
>> This may lead to a race between the fabric controller admin queue
>> destroy code path (invoked while shutting down controller) and hw/hctx
>> queue dispatcher called from the nvme keep-alive async request queuing
>> operation. This race could lead to the kernel crash shown below:
>>
>> Call Trace:
>> autoremove_wake_function+0x0/0xbc (unreliable)
>> __blk_mq_sched_dispatch_requests+0x114/0x24c
>> blk_mq_sched_dispatch_requests+0x44/0x84
>> blk_mq_run_hw_queue+0x140/0x220
>> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>> process_one_work+0x200/0x4e0
>> worker_thread+0x340/0x504
>> kthread+0x138/0x140
>> start_kernel_thread+0x14/0x18
>>
>> While shutting down fabric controller, if nvme keep-alive request sneaks
>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>> then invoked to handle the end of the keep-alive operation which
>> decrements the admin->q_usage_counter and assuming this is the last/only
>> request in the admin queue then the admin->q_usage_counter becomes zero.
>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>> queue()) which could be potentially running simultaneously on another
>> cpu (as this is the controller shutdown code path) would forward
>> progress and deletes the admin queue. So, now from this point onward
>> we are not supposed to access the admin queue resources. However the
>> issue here's that the nvme keep-alive thread running hw/hctx queue
>> dispatch operation hasn't yet finished its work and so it could still
>> potentially access the admin queue resource while the admin queue had
>> been already deleted and that causes the above crash.
>>
>> The above kernel crash is regression caused due to changes implemented
>> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
>> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
>> beggining of the controller shutdown code path so that it wouldn't
>> sneak in during the shutdown operation. However we removed the keep
>> alive stop operation from the beginning of the controller shutdown
>> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
>> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
>> sneaking in and interfering with the shutdown operation and causing
>> observed kernel crash. So to fix this crash, now we're adding back the
>> keep-alive stop operation at very beginning of the fabric controller
>> shutdown code path so that the actual controller shutdown opeation only
>> begins after it's ensured that keep-alive operation is not in-flight and
>> also it can't be scheduled in future.
>>
>> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
>> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 5016f69e9a15..865c00ea19e3 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>> {
>> nvme_mpath_stop(ctrl);
>> nvme_auth_stop(ctrl);
>> + /*
>> + * the transport driver may be terminating the admin tagset a little
>> + * later on, so we cannot have the keep-alive work running
>> + */
>> + nvme_stop_keep_alive(ctrl);
>> nvme_stop_failfast_work(ctrl);
>> flush_work(&ctrl->async_event_work);
>> cancel_work_sync(&ctrl->fw_act_work);
>
> The change looks fine.
>
> IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
> entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
> can be saved?
>
Yes that should work however IMO, stopping keep-alive at very beginning of
shutdown operation would make sense because delaying the stopping of keep-alive
would not be useful anyways once we start the controller shutdown. It may
sneak in unnecessarily while we shutdown controller and later we will have to
flush it off.
And yes, as you mentioned, in this case we would save one call site but
looking at the code we have few other call sites already present where we
call nvme_stop_keep_alive().
>
> thanks,
> Ming
>
>
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation"
2024-10-29 6:48 ` Ming Lei
@ 2024-10-29 12:46 ` Nilay Shroff
0 siblings, 0 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-10-29 12:46 UTC (permalink / raw)
To: Ming Lei
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 10/29/24 12:18, Ming Lei wrote:
> On Mon, Oct 28, 2024 at 1:03 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
>>
>> This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
>> This reverts commit 599d9f3a10eec69ef28a90161763e4bd7c9c02bf.
>>
>> It was realized that the fix implemented to avoid the race
>> condition between keep alive task and the fabric shutdown code
>> path in the commit d06923670b5ia ("nvme: make keep-alive
>> synchronous operation") is not optimal.
>
> I saw you have discussed it a while, but it is still better to describe
> the reason in the commit log.
>
Sure, I would enhance the commit message to clarify it further.
>>
>> We also found that the above race condition is regression caused
>> due to the changes implemented in commit a54a93d0e359 ("nvme: move
>> stopping keep-alive into nvme_uninit_ctrl()"). So we decided to
>
> Can you explain a bit why commit a54a93d0e359 is a regression?
> And what is the race condition?
>
> Without providing the context info, it is hard to review the change.
>
> Thanks,
>
>
OK, the root cause of the race condition and how it could be triggered
has been already discussed here[1].
In any case, if you still want me to clarify it further then please let
me know.
[1]https://lore.kernel.org/all/b03863b2-8816-48cd-aa18-436f1c49ec8c@linux.ibm.com
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-28 4:43 ` Nilay Shroff
@ 2024-10-29 14:58 ` Keith Busch
0 siblings, 0 replies; 21+ messages in thread
From: Keith Busch @ 2024-10-29 14:58 UTC (permalink / raw)
To: Nilay Shroff
Cc: Sagi Grimberg, linux-nvme, hch, axboe, chaitanyak, dlemoal,
gjoyce
On Mon, Oct 28, 2024 at 10:13:25AM +0530, Nilay Shroff wrote:
> On 10/28/24 03:37, Sagi Grimberg wrote:
> > On 27/10/2024 19:02, Nilay Shroff wrote:
> >> We no more need acquiring ctrl->lock before accessing the
> >> NVMe controller state and instead we can now use the helper
> >> nvme_ctrl_state. So replace the use of ctrl->lock from
> >> nvme_keep_alive_end_io function with nvme_ctrl_state call.
> >
> > The patch description is not clear. Why do we "no longer need"
> > to take the ctrl->lock? What exactly changed for this?
> >
> The update of ctrl->state is now done using WRITE_ONCE(). So it
> makes sense to pair the read of ctrl->state using READ_ONCE(). And
> the helper nvme_ctrl_state(), in fact, reads the ctrl->state
> using READ_ONCE(). So, if we use the nvme_ctrl_state() for
> retrieving ctrl->state value then we don't need locking.
>
> I think I would better format the commit message of this patch
> to describe the above intent.
Makes sense to me. If the code was racing with another thread attempting
to change the state, then we either saw the old state or the new state.
Same is true with using the state accessor without the ctrl->lock.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-27 17:02 ` [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
2024-10-27 22:07 ` Sagi Grimberg
@ 2024-10-29 15:01 ` Keith Busch
2024-10-30 6:07 ` Nilay Shroff
1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2024-10-29 15:01 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-nvme, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Sun, Oct 27, 2024 at 10:32:06PM +0530, Nilay Shroff wrote:
> We no more need acquiring ctrl->lock before accessing the
> NVMe controller state and instead we can now use the helper
> nvme_ctrl_state. So replace the use of ctrl->lock from
> nvme_keep_alive_end_io function with nvme_ctrl_state call.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 11 +++--------
> 1 file changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 865c00ea19e3..6e814daf1002 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1296,10 +1296,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
> blk_status_t status)
> {
> struct nvme_ctrl *ctrl = rq->end_io_data;
> - unsigned long flags;
> - bool startka = false;
> unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
> unsigned long delay = nvme_keep_alive_work_period(ctrl);
> + enum nvme_ctrl_state state = nvme_ctrl_state(ctrl->state);
>
> /*
> * Subtract off the keepalive RTT so nvme_keep_alive_work runs
> @@ -1324,13 +1323,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>
> ctrl->ka_last_check_time = jiffies;
> ctrl->comp_seen = false;
> - spin_lock_irqsave(&ctrl->lock, flags);
> - if (ctrl->state == NVME_CTRL_LIVE ||
> - ctrl->state == NVME_CTRL_CONNECTING)
> - startka = true;
> - spin_unlock_irqrestore(&ctrl->lock, flags);
> - if (startka)
> + if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
> queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
> +
> return RQ_END_IO_NONE;
> }
This restores what you had in the reverted patch, which was a
significant portion of the diff stat. If we're keeping this chunk, maybe
a full revert isn't necessary, and we can squash the series to just one
fixing commit.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-29 12:40 ` Nilay Shroff
@ 2024-10-30 2:20 ` Ming Lei
2024-10-30 10:38 ` Nilay Shroff
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-10-30 2:20 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Tue, Oct 29, 2024 at 06:10:00PM +0530, Nilay Shroff wrote:
>
>
> On 10/29/24 12:53, Ming Lei wrote:
> > On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
> >> The nvme keep-alive operation, which executes at a periodic interval,
> >> could potentially sneak in while shutting down a fabric controller.
> >> This may lead to a race between the fabric controller admin queue
> >> destroy code path (invoked while shutting down controller) and hw/hctx
> >> queue dispatcher called from the nvme keep-alive async request queuing
> >> operation. This race could lead to the kernel crash shown below:
> >>
> >> Call Trace:
> >> autoremove_wake_function+0x0/0xbc (unreliable)
> >> __blk_mq_sched_dispatch_requests+0x114/0x24c
> >> blk_mq_sched_dispatch_requests+0x44/0x84
> >> blk_mq_run_hw_queue+0x140/0x220
> >> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
> >> process_one_work+0x200/0x4e0
> >> worker_thread+0x340/0x504
> >> kthread+0x138/0x140
> >> start_kernel_thread+0x14/0x18
> >>
> >> While shutting down fabric controller, if nvme keep-alive request sneaks
> >> in then it would be flushed off. The nvme_keep_alive_end_io function is
> >> then invoked to handle the end of the keep-alive operation which
> >> decrements the admin->q_usage_counter and assuming this is the last/only
> >> request in the admin queue then the admin->q_usage_counter becomes zero.
> >> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> >> queue()) which could be potentially running simultaneously on another
> >> cpu (as this is the controller shutdown code path) would forward
> >> progress and deletes the admin queue. So, now from this point onward
> >> we are not supposed to access the admin queue resources. However the
> >> issue here's that the nvme keep-alive thread running hw/hctx queue
> >> dispatch operation hasn't yet finished its work and so it could still
> >> potentially access the admin queue resource while the admin queue had
> >> been already deleted and that causes the above crash.
> >>
> >> The above kernel crash is regression caused due to changes implemented
> >> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> >> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
> >> beggining of the controller shutdown code path so that it wouldn't
> >> sneak in during the shutdown operation. However we removed the keep
> >> alive stop operation from the beginning of the controller shutdown
> >> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> >> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
> >> sneaking in and interfering with the shutdown operation and causing
> >> observed kernel crash. So to fix this crash, now we're adding back the
> >> keep-alive stop operation at very beginning of the fabric controller
> >> shutdown code path so that the actual controller shutdown opeation only
> >> begins after it's ensured that keep-alive operation is not in-flight and
> >> also it can't be scheduled in future.
> >>
> >> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
> >> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >> ---
> >> drivers/nvme/host/core.c | 5 +++++
> >> 1 file changed, 5 insertions(+)
> >>
> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >> index 5016f69e9a15..865c00ea19e3 100644
> >> --- a/drivers/nvme/host/core.c
> >> +++ b/drivers/nvme/host/core.c
> >> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> >> {
> >> nvme_mpath_stop(ctrl);
> >> nvme_auth_stop(ctrl);
> >> + /*
> >> + * the transport driver may be terminating the admin tagset a little
> >> + * later on, so we cannot have the keep-alive work running
> >> + */
> >> + nvme_stop_keep_alive(ctrl);
> >> nvme_stop_failfast_work(ctrl);
> >> flush_work(&ctrl->async_event_work);
> >> cancel_work_sync(&ctrl->fw_act_work);
> >
> > The change looks fine.
> >
> > IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
> > entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
> > can be saved?
> >
> Yes that should work however IMO, stopping keep-alive at very beginning of
> shutdown operation would make sense because delaying the stopping of keep-alive
> would not be useful anyways once we start the controller shutdown. It may
> sneak in unnecessarily while we shutdown controller and later we will have to
> flush it off.
>
> And yes, as you mentioned, in this case we would save one call site but
> looking at the code we have few other call sites already present where we
> call nvme_stop_keep_alive().
If it works, I'd suggest to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() into nvme_remove_admin_tag_set() because:
1) it isn't correct to do it in nvme_uninit_ctrl()
2) stop keep alive in nvme_remove_admin_tag_set() covers everything
3) most of nvme_stop_keep_alive() can be removed in failure path
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-29 15:01 ` Keith Busch
@ 2024-10-30 6:07 ` Nilay Shroff
2024-11-01 19:13 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Nilay Shroff @ 2024-10-30 6:07 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 10/29/24 20:31, Keith Busch wrote:
> On Sun, Oct 27, 2024 at 10:32:06PM +0530, Nilay Shroff wrote:
>> We no more need acquiring ctrl->lock before accessing the
>> NVMe controller state and instead we can now use the helper
>> nvme_ctrl_state. So replace the use of ctrl->lock from
>> nvme_keep_alive_end_io function with nvme_ctrl_state call.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 11 +++--------
>> 1 file changed, 3 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 865c00ea19e3..6e814daf1002 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -1296,10 +1296,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>> blk_status_t status)
>> {
>> struct nvme_ctrl *ctrl = rq->end_io_data;
>> - unsigned long flags;
>> - bool startka = false;
>> unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
>> unsigned long delay = nvme_keep_alive_work_period(ctrl);
>> + enum nvme_ctrl_state state = nvme_ctrl_state(ctrl->state);
>>
>> /*
>> * Subtract off the keepalive RTT so nvme_keep_alive_work runs
>> @@ -1324,13 +1323,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>>
>> ctrl->ka_last_check_time = jiffies;
>> ctrl->comp_seen = false;
>> - spin_lock_irqsave(&ctrl->lock, flags);
>> - if (ctrl->state == NVME_CTRL_LIVE ||
>> - ctrl->state == NVME_CTRL_CONNECTING)
>> - startka = true;
>> - spin_unlock_irqrestore(&ctrl->lock, flags);
>> - if (startka)
>> + if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
>> queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
>> +
>> return RQ_END_IO_NONE;
>> }
>
> This restores what you had in the reverted patch, which was a
> significant portion of the diff stat. If we're keeping this chunk, maybe
> a full revert isn't necessary, and we can squash the series to just one
> fixing commit.
>
This doesn't exactly restore what was reverted in the patch. If you
closely look at the reverted patch then you would find that we reverted
the nvme_keep_alive_finish function which is not exactly same as
nvme_keep_alive_end_io function. We renamed nvme_keep_alive_end_io()
to nvme_keep_alive_finish() for the upstream patch.
So the revert patch restores nvme_keep_alive_end_io function and also
restores the nvme_keep_alive_work changes. Then the current patch in this
patchset, actually works upon nvme_keep_alive_end_io function.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-30 2:20 ` Ming Lei
@ 2024-10-30 10:38 ` Nilay Shroff
2024-10-30 12:51 ` Ming Lei
0 siblings, 1 reply; 21+ messages in thread
From: Nilay Shroff @ 2024-10-30 10:38 UTC (permalink / raw)
To: Ming Lei
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 10/30/24 07:50, Ming Lei wrote:
> On Tue, Oct 29, 2024 at 06:10:00PM +0530, Nilay Shroff wrote:
>>
>>
>> On 10/29/24 12:53, Ming Lei wrote:
>>> On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>> could potentially sneak in while shutting down a fabric controller.
>>>> This may lead to a race between the fabric controller admin queue
>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>> operation. This race could lead to the kernel crash shown below:
>>>>
>>>> Call Trace:
>>>> autoremove_wake_function+0x0/0xbc (unreliable)
>>>> __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>> blk_mq_sched_dispatch_requests+0x44/0x84
>>>> blk_mq_run_hw_queue+0x140/0x220
>>>> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>> process_one_work+0x200/0x4e0
>>>> worker_thread+0x340/0x504
>>>> kthread+0x138/0x140
>>>> start_kernel_thread+0x14/0x18
>>>>
>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>> then invoked to handle the end of the keep-alive operation which
>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>> queue()) which could be potentially running simultaneously on another
>>>> cpu (as this is the controller shutdown code path) would forward
>>>> progress and deletes the admin queue. So, now from this point onward
>>>> we are not supposed to access the admin queue resources. However the
>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>> dispatch operation hasn't yet finished its work and so it could still
>>>> potentially access the admin queue resource while the admin queue had
>>>> been already deleted and that causes the above crash.
>>>>
>>>> The above kernel crash is regression caused due to changes implemented
>>>> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
>>>> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
>>>> beggining of the controller shutdown code path so that it wouldn't
>>>> sneak in during the shutdown operation. However we removed the keep
>>>> alive stop operation from the beginning of the controller shutdown
>>>> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
>>>> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
>>>> sneaking in and interfering with the shutdown operation and causing
>>>> observed kernel crash. So to fix this crash, now we're adding back the
>>>> keep-alive stop operation at very beginning of the fabric controller
>>>> shutdown code path so that the actual controller shutdown opeation only
>>>> begins after it's ensured that keep-alive operation is not in-flight and
>>>> also it can't be scheduled in future.
>>>>
>>>> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
>>>> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/core.c | 5 +++++
>>>> 1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 5016f69e9a15..865c00ea19e3 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>>>> {
>>>> nvme_mpath_stop(ctrl);
>>>> nvme_auth_stop(ctrl);
>>>> + /*
>>>> + * the transport driver may be terminating the admin tagset a little
>>>> + * later on, so we cannot have the keep-alive work running
>>>> + */
>>>> + nvme_stop_keep_alive(ctrl);
>>>> nvme_stop_failfast_work(ctrl);
>>>> flush_work(&ctrl->async_event_work);
>>>> cancel_work_sync(&ctrl->fw_act_work);
>>>
>>> The change looks fine.
>>>
>>> IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
>>> entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
>>> can be saved?
>>>
>> Yes that should work however IMO, stopping keep-alive at very beginning of
>> shutdown operation would make sense because delaying the stopping of keep-alive
>> would not be useful anyways once we start the controller shutdown. It may
>> sneak in unnecessarily while we shutdown controller and later we will have to
>> flush it off.
>>
>> And yes, as you mentioned, in this case we would save one call site but
>> looking at the code we have few other call sites already present where we
>> call nvme_stop_keep_alive().
>
> If it works, I'd suggest to move nvme_stop_keep_alive() from
> nvme_uninit_ctrl() into nvme_remove_admin_tag_set() because:
>
> 1) it isn't correct to do it in nvme_uninit_ctrl()
>
> 2) stop keep alive in nvme_remove_admin_tag_set() covers everything
>
> 3) most of nvme_stop_keep_alive() can be removed in failure path
>
Moving nvme_stop_keep_alive() from nvme_uninit_ctrl() to nvme_remove_admin_
tag_set() works, I tested it with blktest nvme/037. However, I am afraid
that we may not be able to remove most of nvme_stop_keep_alive() from the
failure/recovery code path. The reason being, in most of the failure/recovery
code path we don't invoke nvme_remove_admin_tag_set().
For instance, in case of nvme_tcp_error_recovery_work(), the nvme_remove_
admin_tag_set() is not invoked and we explicitly call nvme_stop_keep_alive().
The same is true for error recovery case of nvme rdma.
Another example when nvme_tcp_setup_ctrl fails, we don't call nvme_remove_
admin_tag_set() but we explicitly call the nvme_stop_keep_alive().
Third example while resetting tcp/rdam ctrl, we don't call nvme_remove_admin_
tag_set() but we explicitly call the nvme_stop_keep_alive().
In your first point above, you mentioned that calling nvme_stop_keep_alive()
from nvme_uninit_ctrl() isn't correct, would you please explain why is it so,
considering fact that we call nvme_start_keep_alive() from nvme_init_ctrl_
finish()?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-30 10:38 ` Nilay Shroff
@ 2024-10-30 12:51 ` Ming Lei
2024-10-31 10:10 ` Nilay Shroff
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-10-30 12:51 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Wed, Oct 30, 2024 at 04:08:16PM +0530, Nilay Shroff wrote:
>
>
> On 10/30/24 07:50, Ming Lei wrote:
> > On Tue, Oct 29, 2024 at 06:10:00PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 10/29/24 12:53, Ming Lei wrote:
> >>> On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
> >>>> The nvme keep-alive operation, which executes at a periodic interval,
> >>>> could potentially sneak in while shutting down a fabric controller.
> >>>> This may lead to a race between the fabric controller admin queue
> >>>> destroy code path (invoked while shutting down controller) and hw/hctx
> >>>> queue dispatcher called from the nvme keep-alive async request queuing
> >>>> operation. This race could lead to the kernel crash shown below:
> >>>>
> >>>> Call Trace:
> >>>> autoremove_wake_function+0x0/0xbc (unreliable)
> >>>> __blk_mq_sched_dispatch_requests+0x114/0x24c
> >>>> blk_mq_sched_dispatch_requests+0x44/0x84
> >>>> blk_mq_run_hw_queue+0x140/0x220
> >>>> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
> >>>> process_one_work+0x200/0x4e0
> >>>> worker_thread+0x340/0x504
> >>>> kthread+0x138/0x140
> >>>> start_kernel_thread+0x14/0x18
> >>>>
> >>>> While shutting down fabric controller, if nvme keep-alive request sneaks
> >>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
> >>>> then invoked to handle the end of the keep-alive operation which
> >>>> decrements the admin->q_usage_counter and assuming this is the last/only
> >>>> request in the admin queue then the admin->q_usage_counter becomes zero.
> >>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> >>>> queue()) which could be potentially running simultaneously on another
> >>>> cpu (as this is the controller shutdown code path) would forward
> >>>> progress and deletes the admin queue. So, now from this point onward
> >>>> we are not supposed to access the admin queue resources. However the
> >>>> issue here's that the nvme keep-alive thread running hw/hctx queue
> >>>> dispatch operation hasn't yet finished its work and so it could still
> >>>> potentially access the admin queue resource while the admin queue had
> >>>> been already deleted and that causes the above crash.
> >>>>
> >>>> The above kernel crash is regression caused due to changes implemented
> >>>> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> >>>> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
> >>>> beggining of the controller shutdown code path so that it wouldn't
> >>>> sneak in during the shutdown operation. However we removed the keep
> >>>> alive stop operation from the beginning of the controller shutdown
> >>>> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> >>>> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
> >>>> sneaking in and interfering with the shutdown operation and causing
> >>>> observed kernel crash. So to fix this crash, now we're adding back the
> >>>> keep-alive stop operation at very beginning of the fabric controller
> >>>> shutdown code path so that the actual controller shutdown opeation only
> >>>> begins after it's ensured that keep-alive operation is not in-flight and
> >>>> also it can't be scheduled in future.
> >>>>
> >>>> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
> >>>> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
> >>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >>>> ---
> >>>> drivers/nvme/host/core.c | 5 +++++
> >>>> 1 file changed, 5 insertions(+)
> >>>>
> >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>>> index 5016f69e9a15..865c00ea19e3 100644
> >>>> --- a/drivers/nvme/host/core.c
> >>>> +++ b/drivers/nvme/host/core.c
> >>>> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> >>>> {
> >>>> nvme_mpath_stop(ctrl);
> >>>> nvme_auth_stop(ctrl);
> >>>> + /*
> >>>> + * the transport driver may be terminating the admin tagset a little
> >>>> + * later on, so we cannot have the keep-alive work running
> >>>> + */
> >>>> + nvme_stop_keep_alive(ctrl);
> >>>> nvme_stop_failfast_work(ctrl);
> >>>> flush_work(&ctrl->async_event_work);
> >>>> cancel_work_sync(&ctrl->fw_act_work);
> >>>
> >>> The change looks fine.
> >>>
> >>> IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
> >>> entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
> >>> can be saved?
> >>>
> >> Yes that should work however IMO, stopping keep-alive at very beginning of
> >> shutdown operation would make sense because delaying the stopping of keep-alive
> >> would not be useful anyways once we start the controller shutdown. It may
> >> sneak in unnecessarily while we shutdown controller and later we will have to
> >> flush it off.
> >>
> >> And yes, as you mentioned, in this case we would save one call site but
> >> looking at the code we have few other call sites already present where we
> >> call nvme_stop_keep_alive().
> >
> > If it works, I'd suggest to move nvme_stop_keep_alive() from
> > nvme_uninit_ctrl() into nvme_remove_admin_tag_set() because:
> >
> > 1) it isn't correct to do it in nvme_uninit_ctrl()
> >
> > 2) stop keep alive in nvme_remove_admin_tag_set() covers everything
> >
> > 3) most of nvme_stop_keep_alive() can be removed in failure path
> >
>
> Moving nvme_stop_keep_alive() from nvme_uninit_ctrl() to nvme_remove_admin_
> tag_set() works, I tested it with blktest nvme/037. However, I am afraid
> that we may not be able to remove most of nvme_stop_keep_alive() from the
> failure/recovery code path. The reason being, in most of the failure/recovery
> code path we don't invoke nvme_remove_admin_tag_set().
>
> For instance, in case of nvme_tcp_error_recovery_work(), the nvme_remove_
> admin_tag_set() is not invoked and we explicitly call nvme_stop_keep_alive().
> The same is true for error recovery case of nvme rdma.
>
> Another example when nvme_tcp_setup_ctrl fails, we don't call nvme_remove_
> admin_tag_set() but we explicitly call the nvme_stop_keep_alive().
>
> Third example while resetting tcp/rdam ctrl, we don't call nvme_remove_admin_
> tag_set() but we explicitly call the nvme_stop_keep_alive().
If nvme_remove_admin_tag_set() isn't called, do we really need to call
nvme_stop_keep_alive() in the failure path?
>
> In your first point above, you mentioned that calling nvme_stop_keep_alive()
> from nvme_uninit_ctrl() isn't correct, would you please explain why is it so,
> considering fact that we call nvme_start_keep_alive() from nvme_init_ctrl_
> finish()?
nvme_uninit_ctrl() is usually run after nvme_remove_admin_tag_set(), when the
admin queue is destroyed and tagset is freed.
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
2024-10-30 12:51 ` Ming Lei
@ 2024-10-31 10:10 ` Nilay Shroff
0 siblings, 0 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-10-31 10:10 UTC (permalink / raw)
To: Ming Lei
Cc: linux-nvme, kbusch, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 10/30/24 18:21, Ming Lei wrote:
> On Wed, Oct 30, 2024 at 04:08:16PM +0530, Nilay Shroff wrote:
>>
>>
>> On 10/30/24 07:50, Ming Lei wrote:
>>> On Tue, Oct 29, 2024 at 06:10:00PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 10/29/24 12:53, Ming Lei wrote:
>>>>> On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
>>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>>> This may lead to a race between the fabric controller admin queue
>>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>>> operation. This race could lead to the kernel crash shown below:
>>>>>>
>>>>>> Call Trace:
>>>>>> autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>> __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>> blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>> blk_mq_run_hw_queue+0x140/0x220
>>>>>> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>> process_one_work+0x200/0x4e0
>>>>>> worker_thread+0x340/0x504
>>>>>> kthread+0x138/0x140
>>>>>> start_kernel_thread+0x14/0x18
>>>>>>
>>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>>> then invoked to handle the end of the keep-alive operation which
>>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>>> queue()) which could be potentially running simultaneously on another
>>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>>> we are not supposed to access the admin queue resources. However the
>>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>>> potentially access the admin queue resource while the admin queue had
>>>>>> been already deleted and that causes the above crash.
>>>>>>
>>>>>> The above kernel crash is regression caused due to changes implemented
>>>>>> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
>>>>>> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
>>>>>> beggining of the controller shutdown code path so that it wouldn't
>>>>>> sneak in during the shutdown operation. However we removed the keep
>>>>>> alive stop operation from the beginning of the controller shutdown
>>>>>> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
>>>>>> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
>>>>>> sneaking in and interfering with the shutdown operation and causing
>>>>>> observed kernel crash. So to fix this crash, now we're adding back the
>>>>>> keep-alive stop operation at very beginning of the fabric controller
>>>>>> shutdown code path so that the actual controller shutdown opeation only
>>>>>> begins after it's ensured that keep-alive operation is not in-flight and
>>>>>> also it can't be scheduled in future.
>>>>>>
>>>>>> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
>>>>>> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
>>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>>> ---
>>>>>> drivers/nvme/host/core.c | 5 +++++
>>>>>> 1 file changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>>> index 5016f69e9a15..865c00ea19e3 100644
>>>>>> --- a/drivers/nvme/host/core.c
>>>>>> +++ b/drivers/nvme/host/core.c
>>>>>> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>>>>>> {
>>>>>> nvme_mpath_stop(ctrl);
>>>>>> nvme_auth_stop(ctrl);
>>>>>> + /*
>>>>>> + * the transport driver may be terminating the admin tagset a little
>>>>>> + * later on, so we cannot have the keep-alive work running
>>>>>> + */
>>>>>> + nvme_stop_keep_alive(ctrl);
>>>>>> nvme_stop_failfast_work(ctrl);
>>>>>> flush_work(&ctrl->async_event_work);
>>>>>> cancel_work_sync(&ctrl->fw_act_work);
>>>>>
>>>>> The change looks fine.
>>>>>
>>>>> IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
>>>>> entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
>>>>> can be saved?
>>>>>
>>>> Yes that should work however IMO, stopping keep-alive at very beginning of
>>>> shutdown operation would make sense because delaying the stopping of keep-alive
>>>> would not be useful anyways once we start the controller shutdown. It may
>>>> sneak in unnecessarily while we shutdown controller and later we will have to
>>>> flush it off.
>>>>
>>>> And yes, as you mentioned, in this case we would save one call site but
>>>> looking at the code we have few other call sites already present where we
>>>> call nvme_stop_keep_alive().
>>>
>>> If it works, I'd suggest to move nvme_stop_keep_alive() from
>>> nvme_uninit_ctrl() into nvme_remove_admin_tag_set() because:
>>>
>>> 1) it isn't correct to do it in nvme_uninit_ctrl()
>>>
>>> 2) stop keep alive in nvme_remove_admin_tag_set() covers everything
>>>
>>> 3) most of nvme_stop_keep_alive() can be removed in failure path
>>>
>>
>> Moving nvme_stop_keep_alive() from nvme_uninit_ctrl() to nvme_remove_admin_
>> tag_set() works, I tested it with blktest nvme/037. However, I am afraid
>> that we may not be able to remove most of nvme_stop_keep_alive() from the
>> failure/recovery code path. The reason being, in most of the failure/recovery
>> code path we don't invoke nvme_remove_admin_tag_set().
>>
>> For instance, in case of nvme_tcp_error_recovery_work(), the nvme_remove_
>> admin_tag_set() is not invoked and we explicitly call nvme_stop_keep_alive().
>> The same is true for error recovery case of nvme rdma.
>>
>> Another example when nvme_tcp_setup_ctrl fails, we don't call nvme_remove_
>> admin_tag_set() but we explicitly call the nvme_stop_keep_alive().
>>
>> Third example while resetting tcp/rdam ctrl, we don't call nvme_remove_admin_
>> tag_set() but we explicitly call the nvme_stop_keep_alive().
>
> If nvme_remove_admin_tag_set() isn't called, do we really need to call
> nvme_stop_keep_alive() in the failure path?
>
During tcp/rdma error recovery we first stop keep-alive and later keep-alive is
restarted when connection to the fabric controller is restored.
IMO that makes sense as if the connection to the fabric controller is down then
do we really need keep-alive to be running? I think we don't need to let keep-alive
running in this case. So in error recovery path we first stop keep-alive and then
later when the connection to fabric controller is restored we again restart the
keep-alive.
>>
>> In your first point above, you mentioned that calling nvme_stop_keep_alive()
>> from nvme_uninit_ctrl() isn't correct, would you please explain why is it so,
>> considering fact that we call nvme_start_keep_alive() from nvme_init_ctrl_
>> finish()?
>
> nvme_uninit_ctrl() is usually run after nvme_remove_admin_tag_set(), when the
> admin queue is destroyed and tagset is freed.
>
OK understood. I will update the current patch and move nvme_stop_keep_alive()
from nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). So with this change
we could save at-least one call site of nvme_stop_keep_alive().
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-10-30 6:07 ` Nilay Shroff
@ 2024-11-01 19:13 ` Keith Busch
2024-11-03 17:54 ` Nilay Shroff
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2024-11-01 19:13 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-nvme, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On Wed, Oct 30, 2024 at 11:37:15AM +0530, Nilay Shroff wrote:
> On 10/29/24 20:31, Keith Busch wrote:
> >> @@ -1296,10 +1296,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
> >> blk_status_t status)
> >> {
> >> struct nvme_ctrl *ctrl = rq->end_io_data;
> >> - unsigned long flags;
> >> - bool startka = false;
> >> unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
> >> unsigned long delay = nvme_keep_alive_work_period(ctrl);
> >> + enum nvme_ctrl_state state = nvme_ctrl_state(ctrl->state);
> >>
> >> /*
> >> * Subtract off the keepalive RTT so nvme_keep_alive_work runs
> >> @@ -1324,13 +1323,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
> >>
> >> ctrl->ka_last_check_time = jiffies;
> >> ctrl->comp_seen = false;
> >> - spin_lock_irqsave(&ctrl->lock, flags);
> >> - if (ctrl->state == NVME_CTRL_LIVE ||
> >> - ctrl->state == NVME_CTRL_CONNECTING)
> >> - startka = true;
> >> - spin_unlock_irqrestore(&ctrl->lock, flags);
> >> - if (startka)
> >> + if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
> >> queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
> >> +
> >> return RQ_END_IO_NONE;
> >> }
> >
> > This restores what you had in the reverted patch, which was a
> > significant portion of the diff stat. If we're keeping this chunk, maybe
> > a full revert isn't necessary, and we can squash the series to just one
> > fixing commit.
> >
> This doesn't exactly restore what was reverted in the patch. If you
> closely look at the reverted patch then you would find that we reverted
> the nvme_keep_alive_finish function which is not exactly same as
> nvme_keep_alive_end_io function. We renamed nvme_keep_alive_end_io()
> to nvme_keep_alive_finish() for the upstream patch.
>
> So the revert patch restores nvme_keep_alive_end_io function and also
> restores the nvme_keep_alive_work changes. Then the current patch in this
> patchset, actually works upon nvme_keep_alive_end_io function.
Not the whole thing. I'm just talking about the parts I've hightlighted
here. Squashing this series doesn't show any changes to the state check.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function
2024-11-01 19:13 ` Keith Busch
@ 2024-11-03 17:54 ` Nilay Shroff
0 siblings, 0 replies; 21+ messages in thread
From: Nilay Shroff @ 2024-11-03 17:54 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, hch, sagi, axboe, chaitanyak, dlemoal, gjoyce
On 11/2/24 00:43, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 11:37:15AM +0530, Nilay Shroff wrote:
>> On 10/29/24 20:31, Keith Busch wrote:
>>>> @@ -1296,10 +1296,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>>>> blk_status_t status)
>>>> {
>>>> struct nvme_ctrl *ctrl = rq->end_io_data;
>>>> - unsigned long flags;
>>>> - bool startka = false;
>>>> unsigned long rtt = jiffies - (rq->deadline - rq->timeout);
>>>> unsigned long delay = nvme_keep_alive_work_period(ctrl);
>>>> + enum nvme_ctrl_state state = nvme_ctrl_state(ctrl->state);
>>>>
>>>> /*
>>>> * Subtract off the keepalive RTT so nvme_keep_alive_work runs
>>>> @@ -1324,13 +1323,9 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>>>>
>>>> ctrl->ka_last_check_time = jiffies;
>>>> ctrl->comp_seen = false;
>>>> - spin_lock_irqsave(&ctrl->lock, flags);
>>>> - if (ctrl->state == NVME_CTRL_LIVE ||
>>>> - ctrl->state == NVME_CTRL_CONNECTING)
>>>> - startka = true;
>>>> - spin_unlock_irqrestore(&ctrl->lock, flags);
>>>> - if (startka)
>>>> + if (state == NVME_CTRL_LIVE || state == NVME_CTRL_CONNECTING)
>>>> queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
>>>> +
>>>> return RQ_END_IO_NONE;
>>>> }
>>>
>>> This restores what you had in the reverted patch, which was a
>>> significant portion of the diff stat. If we're keeping this chunk, maybe
>>> a full revert isn't necessary, and we can squash the series to just one
>>> fixing commit.
>>>
>> This doesn't exactly restore what was reverted in the patch. If you
>> closely look at the reverted patch then you would find that we reverted
>> the nvme_keep_alive_finish function which is not exactly same as
>> nvme_keep_alive_end_io function. We renamed nvme_keep_alive_end_io()
>> to nvme_keep_alive_finish() for the upstream patch.
>>
>> So the revert patch restores nvme_keep_alive_end_io function and also
>> restores the nvme_keep_alive_work changes. Then the current patch in this
>> patchset, actually works upon nvme_keep_alive_end_io function.
>
> Not the whole thing. I'm just talking about the parts I've hightlighted
> here. Squashing this series doesn't show any changes to the state check.
>
Ok got it! I will squash these parts in the next revision of the patchset.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-11-03 17:54 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-27 17:02 [PATCH 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 6:48 ` Ming Lei
2024-10-29 12:46 ` Nilay Shroff
2024-10-27 17:02 ` [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 7:23 ` Ming Lei
2024-10-29 12:40 ` Nilay Shroff
2024-10-30 2:20 ` Ming Lei
2024-10-30 10:38 ` Nilay Shroff
2024-10-30 12:51 ` Ming Lei
2024-10-31 10:10 ` Nilay Shroff
2024-10-27 17:02 ` [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
2024-10-27 22:07 ` Sagi Grimberg
2024-10-28 4:43 ` Nilay Shroff
2024-10-29 14:58 ` Keith Busch
2024-10-29 15:01 ` Keith Busch
2024-10-30 6:07 ` Nilay Shroff
2024-11-01 19:13 ` Keith Busch
2024-11-03 17:54 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).