public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme-fc: enhance transport error handling when create associations
@ 2023-07-07 21:21 Michael Liang
  2023-07-07 21:21 ` [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association Michael Liang
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Michael Liang @ 2023-07-07 21:21 UTC (permalink / raw)
  To: linux-nvme, jsmart2021; +Cc: randyj, csander, ushankar, psajeepa, Michael Liang

Currently when there is some transport error happens during creating
nvme-fc association, nvme_fc_create_association() could return 0 status code
which implies the association creation was done successfully while it's
actually not. The consequence is the nvme controller could be stuck in
connecting state infinitely. Also there is a potential race condition
between transport io error recovery and the last ASSOC_FAILED check in
nvme_fc_create_association(). We may end up ignoring a transport error during
connecting creation.

Fix these two issues by:
1. Return non-zero status code(-EIO) when needed, so re-connecting
or deleting controller will be triggered properly;
2. Protect accessing to nvme-fc controller ASSOC_FAILED flag and nvme controller
state transition under nvme-fc controller's spin lock;

Michael Liang (2):
  nvme-fc: return non-zero status code when fails to create association
  nvme-fc: fix race between error recovery and creating association

 drivers/nvme/host/fc.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association
  2023-07-07 21:21 [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Michael Liang
@ 2023-07-07 21:21 ` Michael Liang
  2023-07-11 19:48   ` James Smart
  2023-07-07 21:21 ` [PATCH 2/2] nvme-fc: fix race between error recovery and creating association Michael Liang
  2023-07-12 16:30 ` [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Keith Busch
  2 siblings, 1 reply; 7+ messages in thread
From: Michael Liang @ 2023-07-07 21:21 UTC (permalink / raw)
  To: linux-nvme, jsmart2021; +Cc: randyj, csander, ushankar, psajeepa, Michael Liang

Return non-zero status code(-EIO) when needed, so re-connecting or
deleting controller will be triggered properly.

Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
---
 drivers/nvme/host/fc.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 691f2df574ce..ad9336343e01 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2551,6 +2551,9 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
 	if (ctrl->ctrl.state == NVME_CTRL_CONNECTING) {
 		__nvme_fc_abort_outstanding_ios(ctrl, true);
 		set_bit(ASSOC_FAILED, &ctrl->flags);
+		dev_warn(ctrl->ctrl.device,
+			"NVME-FC{%d}: transport error during (re)connect\n",
+			ctrl->cnum);
 		return;
 	}
 
@@ -3110,7 +3113,9 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 	 */
 
 	ret = nvme_enable_ctrl(&ctrl->ctrl);
-	if (ret || test_bit(ASSOC_FAILED, &ctrl->flags))
+	if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
+		ret = -EIO;
+	if (ret)
 		goto out_disconnect_admin_queue;
 
 	ctrl->ctrl.max_segments = ctrl->lport->ops->max_sgl_segments;
@@ -3120,7 +3125,9 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 	nvme_unquiesce_admin_queue(&ctrl->ctrl);
 
 	ret = nvme_init_ctrl_finish(&ctrl->ctrl, false);
-	if (ret || test_bit(ASSOC_FAILED, &ctrl->flags))
+	if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
+		ret = -EIO;
+	if (ret)
 		goto out_disconnect_admin_queue;
 
 	/* sanity checks */
@@ -3165,7 +3172,9 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 		else
 			ret = nvme_fc_recreate_io_queues(ctrl);
 	}
-	if (ret || test_bit(ASSOC_FAILED, &ctrl->flags))
+	if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
+		ret = -EIO;
+	if (ret)
 		goto out_term_aen_ops;
 
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@@ -3180,6 +3189,9 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 out_term_aen_ops:
 	nvme_fc_term_aen_ops(ctrl);
 out_disconnect_admin_queue:
+	dev_warn(ctrl->ctrl.device,
+		"NVME-FC{%d}: create_assoc failed, assoc_id %llx ret %d\n",
+		ctrl->cnum, ctrl->association_id, ret);
 	/* send a Disconnect(association) LS to fc-nvme target */
 	nvme_fc_xmt_disconnect_assoc(ctrl);
 	spin_lock_irqsave(&ctrl->lock, flags);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] nvme-fc: fix race between error recovery and creating association
  2023-07-07 21:21 [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Michael Liang
  2023-07-07 21:21 ` [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association Michael Liang
@ 2023-07-07 21:21 ` Michael Liang
  2023-07-11 19:48   ` James Smart
  2023-11-07  0:02   ` Ming Lei
  2023-07-12 16:30 ` [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Keith Busch
  2 siblings, 2 replies; 7+ messages in thread
From: Michael Liang @ 2023-07-07 21:21 UTC (permalink / raw)
  To: linux-nvme, jsmart2021; +Cc: randyj, csander, ushankar, psajeepa, Michael Liang

There is a small race window between nvme-fc association creation and error
recovery. Fix this race condition by protecting accessing to controller
state and ASSOC_FAILED flag under nvme-fc controller lock.

Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
---
 drivers/nvme/host/fc.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index ad9336343e01..1cd2bf82319a 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2548,17 +2548,24 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
 	 * the controller.  Abort any ios on the association and let the
 	 * create_association error path resolve things.
 	 */
-	if (ctrl->ctrl.state == NVME_CTRL_CONNECTING) {
-		__nvme_fc_abort_outstanding_ios(ctrl, true);
+	enum nvme_ctrl_state state;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	state = ctrl->ctrl.state;
+	if (state == NVME_CTRL_CONNECTING) {
 		set_bit(ASSOC_FAILED, &ctrl->flags);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
+		__nvme_fc_abort_outstanding_ios(ctrl, true);
 		dev_warn(ctrl->ctrl.device,
 			"NVME-FC{%d}: transport error during (re)connect\n",
 			ctrl->cnum);
 		return;
 	}
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
-	if (ctrl->ctrl.state != NVME_CTRL_LIVE)
+	if (state != NVME_CTRL_LIVE)
 		return;
 
 	dev_warn(ctrl->ctrl.device,
@@ -3172,12 +3179,16 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 		else
 			ret = nvme_fc_recreate_io_queues(ctrl);
 	}
+
+	spin_lock_irqsave(&ctrl->lock, flags);
 	if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
 		ret = -EIO;
-	if (ret)
+	if (ret) {
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		goto out_term_aen_ops;
-
+	}
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	ctrl->ctrl.nr_reconnects = 0;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association
  2023-07-07 21:21 ` [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association Michael Liang
@ 2023-07-11 19:48   ` James Smart
  0 siblings, 0 replies; 7+ messages in thread
From: James Smart @ 2023-07-11 19:48 UTC (permalink / raw)
  To: Michael Liang, linux-nvme; +Cc: randyj, csander, ushankar, psajeepa

On 7/7/2023 2:21 PM, Michael Liang wrote:
> Return non-zero status code(-EIO) when needed, so re-connecting or
> deleting controller will be triggered properly.
> 
> Signed-off-by: Michael Liang <mliang@purestorage.com>
> Reviewed-by: Caleb Sander <csander@purestorage.com>
> ---
>   drivers/nvme/host/fc.c | 18 +++++++++++++++---
>   1 file changed, 15 insertions(+), 3 deletions(-)
> 

This patch looks good.  Thanks

Reviewed-by: James Smart <jsmart2021@gmail.com>

-- james




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] nvme-fc: fix race between error recovery and creating association
  2023-07-07 21:21 ` [PATCH 2/2] nvme-fc: fix race between error recovery and creating association Michael Liang
@ 2023-07-11 19:48   ` James Smart
  2023-11-07  0:02   ` Ming Lei
  1 sibling, 0 replies; 7+ messages in thread
From: James Smart @ 2023-07-11 19:48 UTC (permalink / raw)
  To: Michael Liang, linux-nvme; +Cc: randyj, csander, ushankar, psajeepa

On 7/7/2023 2:21 PM, Michael Liang wrote:
> There is a small race window between nvme-fc association creation and error
> recovery. Fix this race condition by protecting accessing to controller
> state and ASSOC_FAILED flag under nvme-fc controller lock.
> 
> Signed-off-by: Michael Liang <mliang@purestorage.com>
> Reviewed-by: Caleb Sander <csander@purestorage.com>
> ---
>   drivers/nvme/host/fc.c | 21 ++++++++++++++++-----
>   1 file changed, 16 insertions(+), 5 deletions(-)

This patch looks good.  Thanks

Reviewed-by: James Smart <jsmart2021@gmail.com>

-- james




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] nvme-fc: enhance transport error handling when create associations
  2023-07-07 21:21 [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Michael Liang
  2023-07-07 21:21 ` [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association Michael Liang
  2023-07-07 21:21 ` [PATCH 2/2] nvme-fc: fix race between error recovery and creating association Michael Liang
@ 2023-07-12 16:30 ` Keith Busch
  2 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2023-07-12 16:30 UTC (permalink / raw)
  To: Michael Liang; +Cc: linux-nvme, jsmart2021, randyj, csander, ushankar, psajeepa

Applied, thanks!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] nvme-fc: fix race between error recovery and creating association
  2023-07-07 21:21 ` [PATCH 2/2] nvme-fc: fix race between error recovery and creating association Michael Liang
  2023-07-11 19:48   ` James Smart
@ 2023-11-07  0:02   ` Ming Lei
  1 sibling, 0 replies; 7+ messages in thread
From: Ming Lei @ 2023-11-07  0:02 UTC (permalink / raw)
  To: Michael Liang
  Cc: linux-nvme, jsmart2021, randyj, csander, ushankar, psajeepa,
	Ming Lei, Ewan Milne, Marco Patalano

On Sat, Jul 8, 2023 at 5:24 AM Michael Liang <mliang@purestorage.com> wrote:
>
> There is a small race window between nvme-fc association creation and error
> recovery. Fix this race condition by protecting accessing to controller
> state and ASSOC_FAILED flag under nvme-fc controller lock.
>
> Signed-off-by: Michael Liang <mliang@purestorage.com>
> Reviewed-by: Caleb Sander <csander@purestorage.com>
> ---

...

> @@ -3172,12 +3179,16 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
>                 else
>                         ret = nvme_fc_recreate_io_queues(ctrl);
>         }
> +
> +       spin_lock_irqsave(&ctrl->lock, flags);
>         if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
>                 ret = -EIO;
> -       if (ret)
> +       if (ret) {
> +               spin_unlock_irqrestore(&ctrl->lock, flags);
>                 goto out_term_aen_ops;
> -
> +       }
>         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> +       spin_unlock_irqrestore(&ctrl->lock, flags);

nvme_change_ctrl_state() may sleep in nvme_kick_requeue_lists(),
this patch has caused regression with
"BUG: scheduling while atomic: kworker/u33:5/31604/0x00000002".

Thanks,



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-07  0:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-07 21:21 [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Michael Liang
2023-07-07 21:21 ` [PATCH 1/2] nvme-fc: return non-zero status code when fails to create association Michael Liang
2023-07-11 19:48   ` James Smart
2023-07-07 21:21 ` [PATCH 2/2] nvme-fc: fix race between error recovery and creating association Michael Liang
2023-07-11 19:48   ` James Smart
2023-11-07  0:02   ` Ming Lei
2023-07-12 16:30 ` [PATCH 0/2] nvme-fc: enhance transport error handling when create associations Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox