From mboxrd@z Thu Jan 1 00:00:00 1970 From: james.smart@broadcom.com (James Smart) Date: Mon, 5 Nov 2018 21:45:21 -0800 Subject: [RFC PATCH] nvme of: don't flush scan work inside reset context In-Reply-To: <20181106011822.GB20193@ming.t460p> References: <20181105115734.15515-1-ming.lei@redhat.com> <782deffe-eb72-3550-2d69-980604609fd5@broadcom.com> <20181106011822.GB20193@ming.t460p> Message-ID: On 11/5/2018 5:18 PM, Ming Lei wrote: > >> I also believe Ming's patch isn't enough.? The issue for the transports are >> several of the flush cases require an io to complete to finish flushing - >> not just scan_work.?? nvme_mpath_stop() syncs with ctrl->ana_work, which may >> be waiting for an ana log to complete.? fw_act_queues() may be in the midst >> of polling the CSTS bit register or doing a fw_slot log read, or a >> stop_ctrl() routine for the transport may be in the middle of a start_ctrl >> call. > These commands won't be retried since REQ_FAILFAST_DRIVER is set for any > request allocated via nvme_alloc_request(). > > So the above flush cases you mentioned should be easier to handle than IO from > scan context. However, looks the patch we discussed before by moving > nvme_stop_ctrl() after nvme_rdma_shutdown_ctrl() is still needed for > avoiding hang on flush in these commands. Agree - nvme_stop_ctrl() has to move so the transport routines that stop/return the io can run.? Leaving it in the same place with an argument won't be enough. > >> Give the transport may be at a point where it's lost connectivity to the >> controller (never really a case for pci unless you consider hot unplug), >> before calling ctlr stop, the transport has to terminate/return outstanding >> io and stop further io from being accepted (usually by marking the queue as >> not connected). > This patch doesn't change this point, all in-flight IO is still terminate/return. > With or without this patch, all these IO will be retried too after controller is > recovered. > > Or you mean all these IOs have to terminate/return before calling > ctrl->ops->stop_ctrl(). Patch as is, that calls the flush routines pointed out, will deadlock as the io won't be terminated and returned by the transport. The flush routines, thus stop_ctrl, has to be called after the transport stops the queues and aborts/returns the ios. Retries after that will do what they do. But the concern I have is - if things like ns_revalidate requires normal io to be completed to be "flushed", then we have an issue. The non-normal io, generated by the core layer and those issued by multipath will fast-fail. But the normal io, whether non-multipath or last-path with multipath, may not be failed as they will be continually requeued or wait for a new path. So I don't know how it exits this deadlock. -- james