From: Hannes Reinecke <hare@suse.de>
To: Daniel Wagner <wagi@kernel.org>,
James Smart <james.smart@broadcom.com>,
Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>,
Sagi Grimberg <sagi@grimberg.me>,
Paul Ely <paul.ely@broadcom.com>
Cc: linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 3/3] nvme-fc: do not ignore connectivity loss during connecting
Date: Mon, 20 Jan 2025 14:45:46 +0100 [thread overview]
Message-ID: <cab2575c-037d-4d9d-896c-3bd2c64c9a0b@suse.de> (raw)
In-Reply-To: <20250109-nvme-fc-handle-com-lost-v4-3-fe5cae17b492@kernel.org>
On 1/9/25 14:30, Daniel Wagner wrote:
> When a connectivity loss occurs while nvme_fc_create_assocation is
> being executed, it's possible that the ctrl ends up stuck in the LIVE
> state:
>
> 1) nvme nvme10: NVME-FC{10}: create association : ...
> 2) nvme nvme10: NVME-FC{10}: controller connectivity lost.
> Awaiting Reconnect
> nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
> 3) nvme nvme10: Could not set queue count (880)
> nvme nvme10: Failed to configure AEN (cfg 900)
> 4) nvme nvme10: NVME-FC{10}: controller connect complete
> 5) nvme nvme10: failed nvme_keep_alive_end_io error=4
>
> A connection attempt starts 1) and the ctrl is in state CONNECTING.
> Shortly after the LLDD driver detects a connection lost event and calls
> nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
> state, this event is ignored.
>
> nvme_fc_create_association continues to run in parallel and tries to
> communicate with the controller and these commands will fail. Though
> these errors are filtered out, e.g in 3) setting the I/O queues numbers
> fails which leads to an early exit in nvme_fc_create_io_queues. Because
> the number of IO queues is 0 at this point, there is nothing left in
> nvme_fc_create_association which could detected the connection drop.
> Thus the ctrl enters LIVE state 4).
>
> Eventually the keep alive handler times out 5) but because nothing is
> being done, the ctrl stays in LIVE state.
>
> There is already the ASSOC_FAILED flag to track connectivity loss event
> but this bit is set too late in the recovery code path. Move this into
> the connectivity loss event handler and synchronize it with the state
> change. This ensures that the ASSOC_FAILED flag is seen by
> nvme_fc_create_io_queues and it does not enter the LIVE state after a
> connectivity loss event. If the connectivity loss event happens after we
> entered the LIVE state the normal error recovery path is executed.
>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
> drivers/nvme/host/fc.c | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 7409da42b9ee580cdd6fe78c0f93e78c4ad08675..55884d3df6f291cfddb4742e135b54a72f1cfa05 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -781,11 +781,19 @@ nvme_fc_abort_lsops(struct nvme_fc_rport *rport)
> static void
> nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
> {
> + enum nvme_ctrl_state state;
> + unsigned long flags;
> +
> dev_info(ctrl->ctrl.device,
> "NVME-FC{%d}: controller connectivity lost. Awaiting "
> "Reconnect", ctrl->cnum);
>
> - switch (nvme_ctrl_state(&ctrl->ctrl)) {
> + spin_lock_irqsave(&ctrl->lock, flags);
> + set_bit(ASSOC_FAILED, &ctrl->flags);
> + state = nvme_ctrl_state(&ctrl->ctrl);
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> +
> + switch (state) {
> case NVME_CTRL_NEW:
> case NVME_CTRL_LIVE:
> /*
> @@ -2542,7 +2550,6 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> */
> if (ctrl->ctrl.state == NVME_CTRL_CONNECTING) {
> __nvme_fc_abort_outstanding_ios(ctrl, true);
> - set_bit(ASSOC_FAILED, &ctrl->flags);
> dev_warn(ctrl->ctrl.device,
> "NVME-FC{%d}: transport error during (re)connect\n",
> ctrl->cnum);
> @@ -3167,12 +3174,18 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
> else
> ret = nvme_fc_recreate_io_queues(ctrl);
> }
> - if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
> - ret = -EIO;
> if (ret)
> goto out_term_aen_ops;
>
> - changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> + spin_lock_irqsave(&ctrl->lock, flags);
> + if (!test_bit(ASSOC_FAILED, &ctrl->flags))
> + changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> + else
> + ret = -EIO;
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> +
> + if (ret)
> + goto out_term_aen_ops;
>
> ctrl->ctrl.nr_reconnects = 0;
>
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
next prev parent reply other threads:[~2025-01-20 13:45 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-09 13:30 [PATCH v4 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create_association Daniel Wagner
2025-01-09 13:30 ` [PATCH v4 1/3] nvme-fc: go straight to connecting state when initializing Daniel Wagner
2025-01-09 13:30 ` [PATCH v4 2/3] nvme: handle connectivity loss in nvme_set_queue_count Daniel Wagner
2025-01-09 13:30 ` [PATCH v4 3/3] nvme-fc: do not ignore connectivity loss during connecting Daniel Wagner
2025-01-10 22:50 ` Sagi Grimberg
2025-01-20 13:45 ` Hannes Reinecke [this message]
2025-02-13 7:16 ` Shinichiro Kawasaki
2025-02-13 9:14 ` Daniel Wagner
2025-02-13 14:22 ` Daniel Wagner
2025-02-14 3:54 ` Shinichiro Kawasaki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cab2575c-037d-4d9d-896c-3bd2c64c9a0b@suse.de \
--to=hare@suse.de \
--cc=hch@lst.de \
--cc=james.smart@broadcom.com \
--cc=kbusch@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=paul.ely@broadcom.com \
--cc=sagi@grimberg.me \
--cc=wagi@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox