From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-dl1-f48.google.com (mail-dl1-f48.google.com [74.125.82.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0897D363C74 for ; Tue, 3 Feb 2026 21:29:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770154145; cv=none; b=VkyxG9ljdilDljEKjjbZaYVpwZXnSS4/dI2lkra/r3ekk82QBsR3ubxkCf/9i2MePVqSsz5KcfXQrmaXfwKPD3uStXgE8WObV9ntu/toiS56FrOS20Ddlnhfsd4BlS36CkhdzvIZ3EES/8t6pXDJYk9sg/AEOFv5wVlS56miZqw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770154145; c=relaxed/simple; bh=lM3kGHlAXR9y3r2x2aSARlwTSBer/DpeVwaO1rl4/jk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=kuL1Sjgd0V37HKMaPZWZC6gQe0IPNO8dP/wrL+zGQmZqEsN+RkcJngbPuK055qoRYrs43RRuwWswjq311et3HYUwPPgHBMYXmEIhr1BVqe7prgw0JFUAzuqxefJsPNfhbrtTGKIiSapLq+UuiCSKMVrhizsEQR9sAAWh6jyeKTU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=R8ogWuCS; arc=none smtp.client-ip=74.125.82.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="R8ogWuCS" Received: by mail-dl1-f48.google.com with SMTP id a92af1059eb24-1233bc1117fso198955c88.0 for ; Tue, 03 Feb 2026 13:29:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1770154143; x=1770758943; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=OA9JI8TdIYc3JTfIvpt0pvGbpY26uLbnQsKyCOsUQh8=; b=R8ogWuCS+M0dgVI3O7LoIf5jKT8BN4mAcbZ73xzL+wolAdDt3oCLi8yNjU2synHumb LhpSxWixPdNbNfHdo5jEz1/q5BT5LzBYF/fNTze8piyIh5AFlKLNzMkxbmNLrhhY/NMp jKAN4i/N3PMLL3RxnbGcMcMNZmLYEPQod082SYu2SLo+YoMzjV6RrvMT4RzbzNCLf5nY PJ6sfSTP0/3tmxvEv5Rk/DgDbq2fqEBvLtCb1GDQMWG1PM4O6DmMr4EuDJ+I14cbh+aU Z8r/y2RKExM3HyzsfzlBZJTGYduN4lJjCa7wwtZESl/NRAPfXXG/dxMXnIYJe9hPLSxc NgpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770154143; x=1770758943; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=OA9JI8TdIYc3JTfIvpt0pvGbpY26uLbnQsKyCOsUQh8=; b=Rg7dWWLDGNmDmgelZuvarSR9wWB2Fk8nLhdj6KrqXP2zlK5hBX4ryGQqgZx/mIIOba ADmW62v1E4M1I3829Uc+xTGJ1hd7VHFPAIc8USgwhSOFUurpdOM3szcDEwkiPwCVHgnY th9jrwgeHI8dVTn50R71Kv6Ctme4m460Gi8CXnaFw8GsiZkU3pCU9XMSRnvaP+IenS51 NEgEjhdKsZzjYBY9My1eQpSUeERWy3zNqzluJtXqQS6dAn3NBaT0mJv69Qsb4eJolcli 3s4lC+WB/ztIifA0hAFkxkdlIN9OCWd8g8eUsP3BOnDScevgUh54aqz53UJG0VS23On6 uCFA== X-Forwarded-Encrypted: i=1; AJvYcCWkWq/KdGpTYeuWinoA7fXSWzKNi4WXmK8kordDcHp78m1KPnhDY23LeYBYDyIvJA1m5wjkDpCdn0gmw8c=@vger.kernel.org X-Gm-Message-State: AOJu0YxhWBk8/CvwTzKNQCEDDYFORrQ0cJSEkMlXxJbj7bS6wKZEbcGo O0ID4uLgeLt1gOi/8MHkRp/UVh3aB0dirLelwdQ01ahI8DDCs8SMbkcfJrDsWnrratA= X-Gm-Gg: AZuq6aLT/Domb+q0WRHpRJ/QjSbzF9E3mF8b3Riziky9Vp8LrUG74thiAvhamZsSH8U xJI8bAs8w8AcblXOGAUL+DIzr7zknj9KXsKFCYVPCLQZ6tLRz1NbrkbSZFxEPJT5WcPJy78+pAR Vj/vSKbwXNA/bDp446lqsMbIrXnAS9r8St4aWTnEVaMr3fiSGP3s8lIakadWewuqLG6ALlWC7jk d6C/d57hWCwJ/Bg6VN4Pcr20E44RDJvrl28cPX2nIAKcB2fzQxwdeUXWBkP9GW7n6phGqySkw60 TyUJr/hplnX9xGr9sdfoG387c0/1mIbr0Yxsg0FL7YZUVQ8htl+c7zB+csU9YLkWU4YLAK/m32h CPBx4NKoquAWW8+UBTzsGRCmHA0xfV6oY7tfpCHzvy5jT1EJLpaRiV/Jtiqn0P96MLGkb7N51yH oM+qvdhcHpfTeCy0TQXKkq8jwjuiiOGxI75hGbBpSejw== X-Received: by 2002:a05:7022:a8e:b0:11b:c86b:3870 with SMTP id a92af1059eb24-126f478f123mr368399c88.4.1770154142792; Tue, 03 Feb 2026 13:29:02 -0800 (PST) Received: from medusa.lab.kspace.sh ([208.88.152.253]) by smtp.googlemail.com with UTF8SMTPSA id a92af1059eb24-126f503087esm432805c88.12.2026.02.03.13.29.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Feb 2026 13:29:02 -0800 (PST) Date: Tue, 3 Feb 2026 13:29:01 -0800 From: Mohamed Khalfella To: Hannes Reinecke Cc: Justin Tee , Naresh Gottumukkala , Paul Ely , Chaitanya Kulkarni , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Aaron Dailey , Randy Jennings , Dhaval Giani , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset Message-ID: <20260203212901.GH3729-mkhalfella@purestorage.com> References: <20260130223531.2478849-1-mkhalfella@purestorage.com> <20260130223531.2478849-13-mkhalfella@purestorage.com> <692717f0-d0c7-4674-8e65-f8bae8dad4fd@suse.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <692717f0-d0c7-4674-8e65-f8bae8dad4fd@suse.de> On Tue 2026-02-03 06:40:28 +0100, Hannes Reinecke wrote: > On 1/30/26 23:34, Mohamed Khalfella wrote: > > nvme_fc_error_recovery() called from nvme_fc_timeout() while controller > > in CONNECTING state results in deadlock reported in link below. Update > > nvme_fc_timeout() to schedule error recovery to avoid the deadlock. > > > > Previous to this change if controller was LIVE error recovery resets > > the controller and this does not match nvme-tcp and nvme-rdma. Decouple > > error recovery from controller reset to match other fabric transports. > > > > Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/ > > Signed-off-by: Mohamed Khalfella > > --- > > drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------ > > 1 file changed, 41 insertions(+), 53 deletions(-) > > > > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c > > index 6948de3f438a..f8f6071b78ed 100644 > > --- a/drivers/nvme/host/fc.c > > +++ b/drivers/nvme/host/fc.c > > @@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt); > > static struct device *fc_udev_device; > > > > static void nvme_fc_complete_rq(struct request *rq); > > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl, > > + char *errmsg); > > > > /* *********************** FC-NVME Port Management ************************ */ > > > > @@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) > > "Reconnect", ctrl->cnum); > > > > set_bit(ASSOC_FAILED, &ctrl->flags); > > - nvme_reset_ctrl(&ctrl->ctrl); > > + nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss"); > > } > > > > /** > > @@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, > > static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *); > > static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *); > > > > -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg); > > +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl); > > > > static void > > __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop) > > @@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop) > > * for the association have been ABTS'd by > > * nvme_fc_delete_association(). > > */ > > - > > - /* fail the association */ > > - nvme_fc_error_recovery(ctrl, "Disconnect Association LS received"); > > + nvme_fc_start_ioerr_recovery(ctrl, > > + "Disconnect Association LS received"); > > > > /* release the reference taken by nvme_fc_match_disconn_ls() */ > > nvme_fc_ctrl_put(ctrl); > > @@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work) > > struct nvme_fc_ctrl *ctrl = > > container_of(work, struct nvme_fc_ctrl, ioerr_work); > > > > - nvme_fc_error_recovery(ctrl, "transport detected io error"); > > + nvme_fc_error_recovery(ctrl); > > } > > > > /* > > @@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req) > > } > > EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid); > > > > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl, > > + char *errmsg) > > +{ > > + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) > > + return; > > + > > + dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n", > > + ctrl->cnum, errmsg); > > + queue_work(nvme_reset_wq, &ctrl->ioerr_work); > > +} > > + > > static void > > nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) > > { > > @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) > > nvme_fc_complete_rq(rq); > > > > check_error: > > - if (terminate_assoc && > > - nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING) > > - queue_work(nvme_reset_wq, &ctrl->ioerr_work); > > + if (terminate_assoc) > > + nvme_fc_start_ioerr_recovery(ctrl, "io error"); > > } > > > > static int > > @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues) > > nvme_unquiesce_admin_queue(&ctrl->ctrl); > > } > > > > -static void > > -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg) > > -{ > > - enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl); > > - > > - /* > > - * if an error (io timeout, etc) while (re)connecting, the remote > > - * port requested terminating of the association (disconnect_ls) > > - * or an error (timeout or abort) occurred on an io while creating > > - * the controller. Abort any ios on the association and let the > > - * create_association error path resolve things. > > - */ > > - if (state == NVME_CTRL_CONNECTING) { > > - __nvme_fc_abort_outstanding_ios(ctrl, true); > > - dev_warn(ctrl->ctrl.device, > > - "NVME-FC{%d}: transport error during (re)connect\n", > > - ctrl->cnum); > > - return; > > - } > > - > > - /* Otherwise, only proceed if in LIVE state - e.g. on first error */ > > - if (state != NVME_CTRL_LIVE) > > - return; > > - > > - dev_warn(ctrl->ctrl.device, > > - "NVME-FC{%d}: transport association event: %s\n", > > - ctrl->cnum, errmsg); > > - dev_warn(ctrl->ctrl.device, > > - "NVME-FC{%d}: resetting controller\n", ctrl->cnum); > > - > > - nvme_reset_ctrl(&ctrl->ctrl); > > -} > > - > > static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq) > > { > > struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq); > > @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq) > > struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu; > > struct nvme_command *sqe = &cmdiu->sqe; > > > > - /* > > - * Attempt to abort the offending command. Command completion > > - * will detect the aborted io and will fail the connection. > > - */ > > dev_info(ctrl->ctrl.device, > > "NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: " > > "x%08x/x%08x\n", > > ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype, > > nvme_fabrics_opcode_str(qnum, sqe), > > sqe->common.cdw10, sqe->common.cdw11); > > - if (__nvme_fc_abort_op(ctrl, op)) > > - nvme_fc_error_recovery(ctrl, "io timeout abort failed"); > > > > - /* > > - * the io abort has been initiated. Have the reset timer > > - * restarted and the abort completion will complete the io > > - * shortly. Avoids a synchronous wait while the abort finishes. > > - */ > > + nvme_fc_start_ioerr_recovery(ctrl, "io timeout"); > > return BLK_EH_RESET_TIMER; > > } > > > > @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work) > > } > > } > > > > +static void > > +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl) > > +{ > > + nvme_stop_keep_alive(&ctrl->ctrl); > > + nvme_stop_ctrl(&ctrl->ctrl); > > + > > + /* will block while waiting for io to terminate */ > > + nvme_fc_delete_association(ctrl); > > + > > + /* Do not reconnect if controller is being deleted */ > > + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) > > + return; > > + > > + if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) { > > + queue_delayed_work(nvme_wq, &ctrl->connect_work, 0); > > + return; > > + } > > + > > + nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN); > > +} > > > > static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = { > > .name = "fc", > > I really don't get it. Why do you need to do additional steps here, when > all you do is split an existing function in half? > Can you help me and point out to the part that is additional? Is it nvme_stop_keep_alive()? If yes, then it matches other transports. I am okay with removing it and we assume that the work will stop when it hits an error, like what it does today. > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg > HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich