From: Mohamed Khalfella <mkhalfella@purestorage.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Justin Tee <justin.tee@broadcom.com>,
Naresh Gottumukkala <nareshgottumukkala83@gmail.com>,
Paul Ely <paul.ely@broadcom.com>,
Chaitanya Kulkarni <kch@nvidia.com>, Jens Axboe <axboe@kernel.dk>,
Keith Busch <kbusch@kernel.org>, Sagi Grimberg <sagi@grimberg.me>,
James Smart <jsmart833426@gmail.com>,
Aaron Dailey <adailey@purestorage.com>,
Randy Jennings <randyj@purestorage.com>,
Dhaval Giani <dgiani@purestorage.com>,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 08/15] nvme: Implement cross-controller reset recovery
Date: Tue, 31 Mar 2026 09:47:33 -0700 [thread overview]
Message-ID: <20260331164733.GC2861-mkhalfella@purestorage.com> (raw)
In-Reply-To: <ee086ee1-a48f-41fb-a7fe-afc6b738cf21@suse.de>
On Mon 2026-03-30 12:50:24 +0200, Hannes Reinecke wrote:
> On 3/28/26 01:43, Mohamed Khalfella wrote:
> > A host that has more than one path connecting to an nvme subsystem
> > typically has an nvme controller associated with every path. This is
> > mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> > path should not be retried immediately on another path because this
> > could lead to data corruption as described in TP4129. TP8028 defines
> > cross-controller reset mechanism that can be used by host to terminate
> > IOs on the failed path using one of the remaining healthy paths. Only
> > after IOs are terminated, or long enough time passes as defined by
> > TP4129, inflight IOs should be retried on another path. Implement core
> > cross-controller reset shared logic to be used by the transports.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/constants.c | 1 +
> > drivers/nvme/host/core.c | 145 ++++++++++++++++++++++++++++++++++
> > drivers/nvme/host/nvme.h | 9 +++
> > 3 files changed, 155 insertions(+)
> >
> > diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> > index dc90df9e13a2..f679efd5110e 100644
> > --- a/drivers/nvme/host/constants.c
> > +++ b/drivers/nvme/host/constants.c
> > @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> > [nvme_admin_virtual_mgmt] = "Virtual Management",
> > [nvme_admin_nvme_mi_send] = "NVMe Send MI",
> > [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> > + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> > [nvme_admin_dbbuf] = "Doorbell Buffer Config",
> > [nvme_admin_format_nvm] = "Format NVM",
> > [nvme_admin_security_send] = "Security Send",
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 824a1193bec8..5603ae36444f 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -554,6 +554,150 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> > }
> > EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
> >
> > +static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
> > + u32 min_cntlid)
> > +{
> > + struct nvme_subsystem *subsys = ictrl->subsys;
> > + struct nvme_ctrl *ctrl, *sctrl = NULL;
> > + unsigned long flags;
> > +
> > + mutex_lock(&nvme_subsystems_lock);
> > + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > + if (ctrl->cntlid < min_cntlid)
> > + continue;
> > +
> > + if (atomic_dec_if_positive(&ctrl->ccr_limit) < 0)
> > + continue;
> > +
> > + spin_lock_irqsave(&ctrl->lock, flags);
> > + if (ctrl->state != NVME_CTRL_LIVE) {
> > + spin_unlock_irqrestore(&ctrl->lock, flags);
> > + atomic_inc(&ctrl->ccr_limit);
> > + continue;
> > + }
> > +
> > + /*
> > + * We got a good candidate source controller that is locked and
> > + * LIVE. However, no guarantee ctrl will not be deleted after
> > + * ctrl->lock is released. Get a ref of both ctrl and admin_q
> > + * so they do not disappear until we are done with them.
> > + */
> > + WARN_ON_ONCE(!blk_get_queue(ctrl->admin_q));
> > + nvme_get_ctrl(ctrl);
> > + spin_unlock_irqrestore(&ctrl->lock, flags);
> > + sctrl = ctrl;
> > + break;
> > + }
> > + mutex_unlock(&nvme_subsystems_lock);
> > + return sctrl;
> > +}
> > +
> > +static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
> > +{
> > + atomic_inc(&sctrl->ccr_limit);
> > + blk_put_queue(sctrl->admin_q);
> > + nvme_put_ctrl(sctrl);
> > +}
> > +
> > +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl,
> > + unsigned long deadline)
> > +{
> > + struct nvme_ccr_entry ccr = { };
> > + union nvme_result res = { 0 };
> > + struct nvme_command c = { };
> > + unsigned long flags, now, tmo = 0;
> > + bool completed = false;
> > + int ret = 0;
> > + u32 result;
> > +
> > + init_completion(&ccr.complete);
> > + ccr.ictrl = ictrl;
> > +
> > + spin_lock_irqsave(&sctrl->lock, flags);
> > + list_add_tail(&ccr.list, &sctrl->ccr_list);
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > +
> > + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> > + c.ccr.ciu = ictrl->ciu;
> > + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> > + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> > + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> > + NULL, 0, NVME_QID_ANY, 0);
> > + if (ret) {
> > + ret = -EIO;
> > + goto out;
> > + }
> > +
> > + result = le32_to_cpu(res.u32);
> > + if (result & 0x01) /* Immediate Reset Successful */
> > + goto out;
> > +
> > + now = jiffies;
> > + if (time_before(now, deadline))
> > + tmo = min_t(unsigned long,
> > + secs_to_jiffies(ictrl->kato), deadline - now);
> > +
> > + if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > + ret = -ETIMEDOUT;
> > + goto out;
> > + }
> > +
> > + completed = true;
> > +
> > +out:
> > + spin_lock_irqsave(&sctrl->lock, flags);
> > + list_del(&ccr.list);
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > + if (completed) {
> > + if (ccr.ccrs == NVME_CCR_STATUS_SUCCESS)
> > + return 0;
> > + return -EREMOTEIO;
> > + }
> > + return ret;
> > +}
> > +
> > +int nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > +{
> > + unsigned long deadline, timeout;
> > + struct nvme_ctrl *sctrl;
> > + u32 min_cntlid = 0;
> > + int ret;
> > +
> > + timeout = nvme_fence_timeout_ms(ictrl);
> > + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > +
> > + deadline = jiffies + msecs_to_jiffies(timeout);
> > + while (time_is_after_jiffies(deadline)) {
> > + sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
> > + if (!sctrl) {
> > + dev_dbg(ictrl->device,
> > + "failed to find source controller\n");
> > + return -EIO;
> > + }
> > +
> > + ret = nvme_issue_wait_ccr(sctrl, ictrl, deadline);
> > + if (!ret) {
> > + dev_info(ictrl->device, "CCR succeeded using %s\n",
> > + dev_name(sctrl->device));
> > + nvme_put_ctrl_ccr(sctrl);
> > + return 0;
> > + }
> > +
> > + min_cntlid = sctrl->cntlid + 1;
> > + nvme_put_ctrl_ccr(sctrl);
> > +
> > + if (ret == -EIO) /* CCR command failed */
> > + continue;
> > +
> > + /* CCR operation failed or timed out */
> > + return ret;
> > + }
> > +
> > + dev_info(ictrl->device, "CCR operation timeout\n");
> > + return -ETIMEDOUT;
> > +}
>
> Please restructure the loop.
> Having a comment 'CCR operation failed or timed out',
> returning a status, and then have a comment
> 'CCR operation timeout' _after_ the return is confusing.
I can change /* CCR operation failed or timed out */ to something like
/*
* Source controller accepted CCR command but CCR operation
* timed out or failed. Retrying another path is not likely
* to succeed, return an error.
*/
And change the log line "CCR operation timeout\n" outside the while
loop to "fencing timedout\n".
Will this help?
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare@suse.de +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
next prev parent reply other threads:[~2026-03-31 16:47 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-28 0:43 [PATCH v4 00/15] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2026-03-28 0:43 ` [PATCH v4 01/15] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2026-03-30 10:37 ` Hannes Reinecke
2026-05-15 2:08 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 02/15] nvmet/debugfs: Export controller CIU and CIRN via debugfs Mohamed Khalfella
2026-05-14 23:42 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 03/15] nvmet: Implement CCR nvme command Mohamed Khalfella
2026-03-30 10:45 ` Hannes Reinecke
2026-03-31 16:38 ` Mohamed Khalfella
2026-04-07 5:40 ` Hannes Reinecke
2026-05-15 0:18 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 04/15] nvmet: Implement CCR logpage Mohamed Khalfella
2026-05-15 0:38 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 05/15] nvmet: Send an AEN on CCR completion Mohamed Khalfella
2026-05-15 0:50 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 06/15] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
2026-05-15 2:03 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 07/15] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
2026-03-30 10:46 ` Hannes Reinecke
2026-05-15 2:06 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 08/15] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2026-03-30 10:50 ` Hannes Reinecke
2026-03-31 16:47 ` Mohamed Khalfella [this message]
2026-04-07 5:39 ` Hannes Reinecke
2026-04-07 20:46 ` Mohamed Khalfella
2026-04-13 15:25 ` Randy Jennings
2026-04-13 16:33 ` Mohamed Khalfella
2026-04-24 23:07 ` Randy Jennings
2026-05-15 2:32 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 09/15] nvme: Implement cross-controller reset completion Mohamed Khalfella
2026-03-30 10:53 ` Hannes Reinecke
2026-03-31 16:55 ` Mohamed Khalfella
2026-04-07 5:48 ` Hannes Reinecke
2026-04-07 19:09 ` Mohamed Khalfella
2026-05-15 2:49 ` Randy Jennings
2026-05-15 2:47 ` Randy Jennings
2026-03-28 0:43 ` [PATCH v4 10/15] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
2026-03-30 11:00 ` Hannes Reinecke
2026-03-28 0:43 ` [PATCH v4 11/15] nvme-rdma: " Mohamed Khalfella
2026-03-28 0:43 ` [PATCH v4 12/15] nvme-fc: Refactor IO error recovery Mohamed Khalfella
2026-03-28 0:43 ` [PATCH v4 13/15] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
2026-03-28 0:43 ` [PATCH v4 14/15] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
2026-03-28 0:43 ` [PATCH v4 15/15] nvme-fc: Do not cancel requests in io taget before it is initialized Mohamed Khalfella
2026-05-12 21:40 ` [PATCH v4 00/15] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2026-05-12 22:02 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260331164733.GC2861-mkhalfella@purestorage.com \
--to=mkhalfella@purestorage.com \
--cc=adailey@purestorage.com \
--cc=axboe@kernel.dk \
--cc=dgiani@purestorage.com \
--cc=hare@suse.de \
--cc=jsmart833426@gmail.com \
--cc=justin.tee@broadcom.com \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=nareshgottumukkala83@gmail.com \
--cc=paul.ely@broadcom.com \
--cc=randyj@purestorage.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.