From mboxrd@z Thu Jan  1 00:00:00 1970
From: james.smart@broadcom.com (James Smart)
Date: Mon, 5 Nov 2018 21:45:21 -0800
Subject: [RFC PATCH] nvme of: don't flush scan work inside reset context
In-Reply-To: <20181106011822.GB20193@ming.t460p>
References: <20181105115734.15515-1-ming.lei@redhat.com>
 <782deffe-eb72-3550-2d69-980604609fd5@broadcom.com>
 <20181106011822.GB20193@ming.t460p>
Message-ID: <dbfed2f9-4ad3-40ce-7c6d-0feae2b4128f@broadcom.com>


On 11/5/2018 5:18 PM, Ming Lei wrote:
>
>> I also believe Ming's patch isn't enough.? The issue for the transports are
>> several of the flush cases require an io to complete to finish flushing -
>> not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, which may
>> be waiting for an ana log to complete.? fw_act_queues() may be in the midst
>> of polling the CSTS bit register or doing a fw_slot log read, or a
>> stop_ctrl() routine for the transport may be in the middle of a start_ctrl
>> call.
> These commands won't be retried since REQ_FAILFAST_DRIVER is set for any
> request allocated via nvme_alloc_request().
>
> So the above flush cases you mentioned should be easier to handle than IO from
> scan context. However, looks the patch we discussed before by moving
> nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for
> avoiding hang on flush in these commands.

Agree - nvme_stop_ctrl() has to move so the transport routines that 
stop/return the io can run.? Leaving it in the same place with an 
argument won't be enough.


>
>> Give the transport may be at a point where it's lost connectivity to the
>> controller (never really a case for pci unless you consider hot unplug),
>> before calling ctlr stop, the transport has to terminate/return outstanding
>> io and stop further io from being accepted (usually by marking the queue as
>> not connected).
> This patch doesn't change this point, all in-flight IO is still terminate/return.
> With or without this patch, all these IO will be retried too after controller is
> recovered.
>
> Or you mean all these IOs have to terminate/return before calling
> ctrl->ops->stop_ctrl().

Patch as is, that calls the flush routines pointed out, will deadlock as 
the io won't be terminated and returned by the transport. The flush 
routines, thus stop_ctrl, has to be called after the transport stops the 
queues and aborts/returns the ios. Retries after that will do what they do.

But the concern I have is - if things like ns_revalidate requires normal 
io to be completed to be "flushed", then we have an issue. The 
non-normal io, generated by the core layer and those issued by multipath 
will fast-fail. But the normal io, whether non-multipath or last-path 
with multipath, may not be failed as they will be continually requeued 
or wait for a new path. So I don't know how it exits this deadlock.

-- james