public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Daniel Wagner <wagi@kernel.org>, Hannes Reinecke <hare@suse.de>,
	Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	james.smart@broadcom.com, linux-nvme@lists.infradead.org
Subject: [PATCH AUTOSEL 6.12 02/19] nvme-fc: do not ignore connectivity loss during connecting
Date: Mon, 10 Feb 2025 20:30:30 -0500	[thread overview]
Message-ID: <20250211013047.4096767-2-sashal@kernel.org> (raw)
In-Reply-To: <20250211013047.4096767-1-sashal@kernel.org>

From: Daniel Wagner <wagi@kernel.org>

[ Upstream commit ee59e3820ca92a9f4307ae23dfc7229dc8b8d400 ]

When a connectivity loss occurs while nvme_fc_create_assocation is
being executed, it's possible that the ctrl ends up stuck in the LIVE
state:

  1) nvme nvme10: NVME-FC{10}: create association : ...
  2) nvme nvme10: NVME-FC{10}: controller connectivity lost.
                  Awaiting Reconnect
     nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
  3) nvme nvme10: Could not set queue count (880)
     nvme nvme10: Failed to configure AEN (cfg 900)
  4) nvme nvme10: NVME-FC{10}: controller connect complete
  5) nvme nvme10: failed nvme_keep_alive_end_io error=4

A connection attempt starts 1) and the ctrl is in state CONNECTING.
Shortly after the LLDD driver detects a connection lost event and calls
nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
state, this event is ignored.

nvme_fc_create_association continues to run in parallel and tries to
communicate with the controller and these commands will fail. Though
these errors are filtered out, e.g in 3) setting the I/O queues numbers
fails which leads to an early exit in nvme_fc_create_io_queues. Because
the number of IO queues is 0 at this point, there is nothing left in
nvme_fc_create_association which could detected the connection drop.
Thus the ctrl enters LIVE state 4).

Eventually the keep alive handler times out 5) but because nothing is
being done, the ctrl stays in LIVE state.

There is already the ASSOC_FAILED flag to track connectivity loss event
but this bit is set too late in the recovery code path. Move this into
the connectivity loss event handler and synchronize it with the state
change. This ensures that the ASSOC_FAILED flag is seen by
nvme_fc_create_io_queues and it does not enter the LIVE state after a
connectivity loss event. If the connectivity loss event happens after we
entered the LIVE state the normal error recovery path is executed.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/nvme/host/fc.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index d45ab530ff9b7..b211a29b13f25 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -782,11 +782,19 @@ nvme_fc_abort_lsops(struct nvme_fc_rport *rport)
 static void
 nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 {
+	enum nvme_ctrl_state state;
+	unsigned long flags;
+
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: controller connectivity lost. Awaiting "
 		"Reconnect", ctrl->cnum);
 
-	switch (nvme_ctrl_state(&ctrl->ctrl)) {
+	spin_lock_irqsave(&ctrl->lock, flags);
+	set_bit(ASSOC_FAILED, &ctrl->flags);
+	state = nvme_ctrl_state(&ctrl->ctrl);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+
+	switch (state) {
 	case NVME_CTRL_NEW:
 	case NVME_CTRL_LIVE:
 		/*
@@ -2543,7 +2551,6 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
 	 */
 	if (ctrl->ctrl.state == NVME_CTRL_CONNECTING) {
 		__nvme_fc_abort_outstanding_ios(ctrl, true);
-		set_bit(ASSOC_FAILED, &ctrl->flags);
 		dev_warn(ctrl->ctrl.device,
 			"NVME-FC{%d}: transport error during (re)connect\n",
 			ctrl->cnum);
@@ -3168,12 +3175,18 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 		else
 			ret = nvme_fc_recreate_io_queues(ctrl);
 	}
-	if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
-		ret = -EIO;
 	if (ret)
 		goto out_term_aen_ops;
 
-	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
+	spin_lock_irqsave(&ctrl->lock, flags);
+	if (!test_bit(ASSOC_FAILED, &ctrl->flags))
+		changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
+	else
+		ret = -EIO;
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+
+	if (ret)
+		goto out_term_aen_ops;
 
 	ctrl->ctrl.nr_reconnects = 0;
 
-- 
2.39.5


  reply	other threads:[~2025-02-11  1:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-11  1:30 [PATCH AUTOSEL 6.12 01/19] nvme-fc: go straight to connecting state when initializing Sasha Levin
2025-02-11  1:30 ` Sasha Levin [this message]
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 03/19] hrtimers: Mark is_migration_base() with __always_inline Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 04/19] powercap: call put_device() on an error path in powercap_register_control_type() Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 05/19] btrfs: avoid starting new transaction when cleaning qgroup during subvolume drop Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 06/19] futex: Pass in task to futex_queue() Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 07/19] iscsi_ibft: Fix UBSAN shift-out-of-bounds warning in ibft_attr_show_nic() Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 08/19] sched/debug: Provide slice length for fair tasks Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 09/19] platform/x86/intel: pmc: fix ltr decode in pmc_core_ltr_show() Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 10/19] drm/amd/display: Fix out-of-bound accesses Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 11/19] scsi: core: Use GFP_NOIO to avoid circular locking dependency Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 12/19] scsi: ufs: core: Fix error return with query response Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 13/19] scsi: qla1280: Fix kernel oops when debug level > 2 Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 14/19] Revert "drm/amd/display: Use HW lock mgr for PSR1" Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 15/19] ACPI: resource: IRQ override for Eluktronics MECH-17 Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 16/19] smb: client: fix noisy when tree connecting to DFS interlink targets Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 17/19] alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 18/19] vboxsf: fix building with GCC 15 Sasha Levin
2025-02-11  1:30 ` [PATCH AUTOSEL 6.12 19/19] selftests: always check mask returned by statmount(2) Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250211013047.4096767-2-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=hare@suse.de \
    --cc=james.smart@broadcom.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    --cc=stable@vger.kernel.org \
    --cc=wagi@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox