Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

From: Sagi Grimberg <sagi@grimberg.me>
To: Randy Jennings <randyj@purestorage.com>
Cc: Mohamed Khalfella <mkhalfella@purestorage.com>,
	Chaitanya Kulkarni <kch@nvidia.com>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Keith Busch <kbusch@kernel.org>,
	Aaron Dailey <adailey@purestorage.com>,
	John Meneghini <jmeneghi@redhat.com>,
	Hannes Reinecke <hare@suse.de>,
	linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
Date: Sun, 4 Jan 2026 23:14:38 +0200	[thread overview]
Message-ID: <3fb01be3-e162-42fa-8d75-2e96a89b2004@grimberg.me> (raw)
In-Reply-To: <CAPpK+O0dmysf6HxQN68q-JzSm-YATXQ1ZUcz-D56-+4WE-Aj7Q@mail.gmail.com>



On 31/12/2025 2:04, Randy Jennings wrote:
>>> +
>>> +             if (!ret) {
>>> +                     dev_info(ictrl->device, "CCR succeeded using %s\n",
>>> +                              dev_name(sctrl->device));
>>> +                     blk_put_queue(sctrl->admin_q);
>>> +                     nvme_put_ctrl(sctrl);
>>> +                     return 0;
>>> +             }
>>> +
>>> +             /* Try another controller */
>>> +             min_cntlid = sctrl->cntlid + 1;
>> OK, I see why min_cntlid is used. That is very non-intuitive.
>>
>> I'm wandering if it will be simpler to take one-shot at ccr and
>> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
>> unable
>> to reset the ictrl in time, how would another ctrl do a better job here?
> There are many different kinds of failures we are dealing with here
> that result in a dropped connection (association).  It could be a problem
> with the specific link, or it could be that the node of an HA pair in the
> storage array went down.  In the case of a specific link problem, maybe
> only one of the connections is down and any controller would work.
> In the case of the node of an HA pair, roughly half of the connections
> are going down, and there is a race between the controllers which
> are detected down first.  There were some heuristics put into the
> spec about deciding which controller to use, but that is more code
> and a refinement that could come later (and they are still heuristics;
> they may not be helpful).
>
> Because CCR offers a significant win of shortening the recovery time
> substantially, it is worth retrying on the other controllers. This time
> affects when we can start retrying IO.  KATO is in seconds, and
> NVMEoF should have the capability of doing a significant amount of
> IOs in each of those seconds.

But it doesn't actually do I/O, it issues I/O and then wait for it to 
time out.

>
> Besides, the alternative is just to wait.  Might as well be actively trying
> to shorten that wait time.  Besides a small increase in code complexity,
> is there a downside to doing so?

Simplicity is very important when it comes to non-trivial code paths 
like error recovery.

next prev parent reply	other threads:[~2026-01-04 21:14 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-26  2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2025-12-16  1:35   ` Randy Jennings
2025-11-26  2:11 ` [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
2025-12-16  1:43   ` Randy Jennings
2025-11-26  2:11 ` [RFC PATCH 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
2025-12-16  3:01   ` Randy Jennings
2025-12-31 21:14     ` Mohamed Khalfella
2025-12-25 13:14   ` Sagi Grimberg
2025-12-25 17:33     ` Mohamed Khalfella
2025-12-27  9:39       ` Sagi Grimberg
2025-12-31 21:35         ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
2025-12-16  3:11   ` Randy Jennings
2025-11-26  2:11 ` [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
2025-12-16  3:31   ` Randy Jennings
2025-12-25 13:23   ` Sagi Grimberg
2025-12-25 18:13     ` Mohamed Khalfella
2025-12-27  9:48       ` Sagi Grimberg
2025-12-31 22:00         ` Mohamed Khalfella
2026-01-04 21:09           ` Sagi Grimberg
2026-01-07  2:58             ` Randy Jennings
2026-01-30 22:31             ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
2025-12-18 15:22   ` Randy Jennings
2025-12-31 22:26     ` Mohamed Khalfella
2026-01-02 19:06       ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state Mohamed Khalfella
2025-12-18 23:18   ` Randy Jennings
2025-12-19  1:39     ` Randy Jennings
2025-12-25 13:29   ` Sagi Grimberg
2025-12-25 17:17     ` Mohamed Khalfella
2025-12-27  9:52       ` Sagi Grimberg
2025-12-31 22:45         ` Mohamed Khalfella
2025-12-27  9:55       ` Sagi Grimberg
2025-12-31 22:36         ` Mohamed Khalfella
2025-12-31 23:04           ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2025-12-19  1:21   ` Randy Jennings
2025-12-27 10:14   ` Sagi Grimberg
2025-12-31  0:04     ` Randy Jennings
2026-01-04 21:14       ` Sagi Grimberg [this message]
2026-01-07  3:16         ` Randy Jennings
2025-12-31 23:43     ` Mohamed Khalfella
2026-01-04 21:39       ` Sagi Grimberg
2026-01-30 22:01         ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
2025-12-19  1:31   ` Randy Jennings
2025-12-27 10:24   ` Sagi Grimberg
2025-12-31 23:51     ` Mohamed Khalfella
2026-01-04 21:15       ` Sagi Grimberg
2026-01-30 22:32         ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
2025-12-19  2:06   ` Randy Jennings
2026-01-01  0:04     ` Mohamed Khalfella
2025-12-27 10:35   ` Sagi Grimberg
2025-12-31  0:13     ` Randy Jennings
2026-01-04 21:19       ` Sagi Grimberg
2026-01-01  0:27     ` Mohamed Khalfella
2025-11-26  2:11 ` [RFC PATCH 11/14] nvme-rdma: " Mohamed Khalfella
2025-12-19  2:16   ` Randy Jennings
2025-12-27 10:36   ` Sagi Grimberg
2025-11-26  2:11 ` [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
2025-12-19  2:59   ` Randy Jennings
2025-11-26  2:12 ` [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
2025-12-20  1:21   ` Randy Jennings
2025-11-26  2:12 ` [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state Mohamed Khalfella
2025-12-20  1:44   ` Randy Jennings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3fb01be3-e162-42fa-8d75-2e96a89b2004@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=adailey@purestorage.com \
    --cc=axboe@kernel.dk \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jmeneghi@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mkhalfella@purestorage.com \
    --cc=randyj@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox