Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Wagner <dwagner@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <wagi@kernel.org>, Christoph Hellwig <hch@lst.de>,
	 Keith Busch <kbusch@kernel.org>, Hannes Reinecke <hare@suse.de>,
	 John Meneghini <jmeneghi@redhat.com>,
	randyj@purestorage.com,
	 Mohamed Khalfella <mkhalfella@purestorage.com>,
	linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout
Date: Wed, 16 Apr 2025 10:51:38 +0200	[thread overview]
Message-ID: <12892dbf-9cdc-48c3-a67d-0c626d54100a@flourine.local> (raw)
In-Reply-To: <94ec166a-2025-4743-8e36-fd1e96c33d6e@grimberg.me>

On Mon, Apr 14, 2025 at 01:03:57AM +0300, Sagi Grimberg wrote:
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -239,6 +239,7 @@ static void nvme_do_delete_ctrl(struct nvme_ctrl *ctrl)
> >   	flush_work(&ctrl->reset_work);
> >   	nvme_stop_ctrl(ctrl);
> > +	nvme_flush_failover(ctrl);
> 
> This will terminate all pending failvoer commands - this is the intended
> behavior?

Yes it will, I don't think so. From the feedback I gather so far, it
seems the correct way (spec) is to handle each command individually.

FWIW, we are shipping

  https://lore.kernel.org/linux-nvme/20230908100049.80809-3-hare@suse.de/

as stop-gap solution for a while and our customer wasn't able to
reproduce the ghost writes anymore. And as far I can tell it does also
failover all pending commands.

> 
> >   	nvme_remove_namespaces(ctrl);
> >   	ctrl->ops->delete_ctrl(ctrl);
> >   	nvme_uninit_ctrl(ctrl);
> > @@ -1310,6 +1311,19 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
> >   	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
> >   }
> > +void nvme_schedule_failover(struct nvme_ctrl *ctrl)
> > +{
> > +	unsigned long delay;
> > +
> > +	if (ctrl->cqt)
> > +		delay = msecs_to_jiffies(ctrl->cqt);
> > +	else
> > +		delay = ctrl->kato * HZ;
> 
> This delay was there before?

No, this is function returns the additional time the host should wait
for it starts to failover. So this is the CQT value, after the KATO
timeout protocol and this timeout is not present in the nvme subsystem.

> >   void nvme_failover_req(struct request *req)
> >   {
> >   	struct nvme_ns *ns = req->q->queuedata;
> > +	struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
> >   	u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
> >   	unsigned long flags;
> >   	struct bio *bio;
> > +	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
> >   	nvme_mpath_clear_current_path(ns);
> > @@ -121,9 +123,53 @@ void nvme_failover_req(struct request *req)
> >   	blk_steal_bios(&ns->head->requeue_list, req);
> >   	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> > -	nvme_req(req)->status = 0;
> > -	nvme_end_req(req);
> > -	kblockd_schedule_work(&ns->head->requeue_work);
> > +	spin_lock_irqsave(&ctrl->lock, flags);
> > +	list_add_tail(&req->queuelist, &ctrl->failover_list);
> > +	spin_unlock_irqrestore(&ctrl->lock, flags);
> 
> So these request - which we stripped their bios, are now placed
> on a failover queue?

Yes, this is the idea.

> > +
> > +	if (state == NVME_CTRL_DELETING) {
> > +		/*
> > +		 * request which fail over in the DELETING state were
> > +		 * canceled and blk_mq_tagset_wait_completed_request will
> > +		 * block until they have been proceed. Though
> > +		 * nvme_failover_work is already stopped. Thus schedule
> > +		 * a failover; it's still necessary to delay these commands
> > +		 * by CQT.
> > +		 */
> > +		nvme_schedule_failover(ctrl);
> 
> This condition is unclear. Isn't any request failing over should do this?
> Something is unclear here.

Ye, the comment is not very clear. Sorry about that. I observed that
blk_mq_tagset_wait_completed_request would block when the controller was
already in DELETING state and request were canceled, IIRC. The commands
would be queued on the failover queue but never processed
nvme_failover_work.

> > +	nvme_schedule_failover(&ctrl->ctrl);
> >   	nvme_rdma_teardown_io_queues(ctrl, shutdown);
> >   	nvme_quiesce_admin_queue(&ctrl->ctrl);
> >   	nvme_disable_ctrl(&ctrl->ctrl, shutdown);
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index d0023bcfd8a79a193adf2807a24481c8c164a174..3a6c1d3febaf233996e4dcf684793327b5d1412f 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2345,6 +2345,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   	nvme_stop_keep_alive(ctrl);
> >   	flush_work(&ctrl->async_event_work);
> > +	nvme_schedule_failover(ctrl);
> >   	nvme_tcp_teardown_io_queues(ctrl, false);
> >   	/* unquiesce to fail fast pending requests */
> >   	nvme_unquiesce_io_queues(ctrl);
> > 
> 
> What is the reason for rdma and tcp not being identical?

Sorry, can't remember. But I really want to start moving this code into
one place.


  reply	other threads:[~2025-04-16 10:31 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-24 12:07 [PATCH RFC 0/3] nvme: add support for command quiesce timeout Daniel Wagner
2025-03-24 12:07 ` [PATCH RFC 1/3] nvmet: add command quiesce time Daniel Wagner
2025-04-01  9:33   ` Hannes Reinecke
2025-04-10  9:00   ` Mohamed Khalfella
2025-04-16 11:37     ` Daniel Wagner
2025-03-24 12:07 ` [PATCH RFC 2/3] nvme: store cqt value into nvme ctrl object Daniel Wagner
2025-04-01  9:34   ` Hannes Reinecke
2025-03-24 12:07 ` [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout Daniel Wagner
2025-04-01  9:37   ` Hannes Reinecke
2025-04-15 12:00     ` Daniel Wagner
2025-04-01 13:32   ` Nilay Shroff
2025-04-15 12:05     ` Daniel Wagner
2025-04-10  8:51   ` Mohamed Khalfella
2025-04-14 22:28     ` Sagi Grimberg
2025-04-15 12:11       ` Daniel Wagner
2025-04-15 21:07         ` Sagi Grimberg
2025-04-15 23:02           ` Randy Jennings
2025-04-15 23:35             ` Sagi Grimberg
2025-04-15 23:57               ` Randy Jennings
2025-04-16 22:15                 ` Sagi Grimberg
2025-04-17  0:47                   ` Randy Jennings
2025-04-15 12:17     ` Daniel Wagner
2025-04-15 22:56       ` Randy Jennings
2025-04-16  6:39         ` Daniel Wagner
2025-04-16  0:17       ` Mohamed Khalfella
2025-04-16  6:57         ` Daniel Wagner
2025-04-16 13:39           ` Mohamed Khalfella
2025-04-16  0:40       ` Mohamed Khalfella
2025-04-16  8:30         ` Daniel Wagner
2025-04-16 13:53           ` Mohamed Khalfella
2025-04-16 22:21             ` Sagi Grimberg
2025-04-16 22:59               ` Mohamed Khalfella
2025-04-17  7:28                 ` Hannes Reinecke
2025-04-10 16:07   ` Jiewei Ke
2025-04-10 17:13   ` Jiewei Ke
2025-04-13 22:03   ` Sagi Grimberg
2025-04-16  8:51     ` Daniel Wagner [this message]
2025-04-16  0:23   ` Mohamed Khalfella
2025-04-16 11:33     ` Daniel Wagner
     [not found] <8F2489FD-1663-4A52-A50B-F15046AC2878@163.com>
2025-04-15 12:34 ` Daniel Wagner
2025-04-15 15:08   ` Jiewei Ke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12892dbf-9cdc-48c3-a67d-0c626d54100a@flourine.local \
    --to=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jmeneghi@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mkhalfella@purestorage.com \
    --cc=randyj@purestorage.com \
    --cc=sagi@grimberg.me \
    --cc=wagi@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox