From mboxrd@z Thu Jan  1 00:00:00 1970
From: ming.lei@redhat.com (Ming Lei)
Date: Fri, 21 Jun 2019 14:58:52 +0800
Subject: [PATCH 2/2] nvme: flush scan_work when resetting controller
In-Reply-To: <3dbb8dc0-2491-6226-8715-b0f5b7f6a73a@suse.de>
References: <20190618101025.78840-1-hare@suse.de>
 <20190618101025.78840-3-hare@suse.de>
 <20190620013650.GB31179@ming.t460p>
 <3dbb8dc0-2491-6226-8715-b0f5b7f6a73a@suse.de>
Message-ID: <20190621065851.GA22145@ming.t460p>

On Fri, Jun 21, 2019@08:14:45AM +0200, Hannes Reinecke wrote:
> On 6/20/19 3:36 AM, Ming Lei wrote:
> > On Tue, Jun 18, 2019@12:10:25PM +0200, Hannes Reinecke wrote:
> >> When resetting the controller there is no point whatsoever to
> >> have a scan run in parallel; we cannot access the controller and
> > 
> > scan won't be run in parallel, because .scan_work is embedded in
> > 'struct nvme_ctrl' which is per-HBA.
> > 
> Wrong. We do.
> Not sure why having it embedded in the controller structure might
> prevent this from happening; both reset and scan are embedded, but
> running on different queues:

I mean the scan_work function itself is run exclusively, but yes, it can be 
run when resetting is in-progress.

> 
> void nvme_queue_scan(struct nvme_ctrl *ctrl)
> {
> 	/*
> 	 * Only new queue scan work when admin and IO queues are both alive
> 	 */
> 	if (ctrl->state == NVME_CTRL_LIVE)
> 		queue_work(nvme_wq, &ctrl->scan_work);
> }
> EXPORT_SYMBOL_GPL(nvme_queue_scan);
> 
> int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
> {
> 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> 		return -EBUSY;
> 	if (!queue_work(nvme_reset_wq, &ctrl->reset_work))
> 		return -EBUSY;
> 	return 0;
> }
> EXPORT_SYMBOL_GPL(nvme_reset_ctrl);
> 
> So there's nothing stopping them to run in parallel.
> 
> >> we cannot tell which devices are present and which not.
> >> Additionally we'll run a scan after reset anyway.
> >> So flush existing scans before reconnecting, ensuring to
> >> short-circuit the scan workqueue function if the controller state
> >> isn't live to avoid lockups.
> > 
> > This way may cause dead-lock.
> > 
> > 1) nvme_revalidate_disk() might freeze queue in flush context, however
> > any in-flight requests won't be completed until reset is done, so
> > deadlock may be caused by flushing scans in reset context.
> > 
> Which is why I'm checking the controller state; I've observed the
> deadlock plenty of times before introducing the controller state check.

Your check can't help wrt. the deadlock, for example:

1) in scan work context:

- blk_mq_freeze_queue() is being started after passing the controller
  state check

2) timeout & reset is triggered in another context at the exact same time:

- all in-flight IOs won't be freed until disable controller & reset is done.

- now flush_work() in reset context can't move on, because
  blk_mq_freeze_queue() in scan context can't make progress.

> 
> > 2) sync IO may be involved in revalidate_disk() which is called in
> > scan context, so deadlock is caused for same reason with 1).
> > 
> I/O during revalidate_disk() is protected by the state check, too, so we
> won't be issuing any I/O during resetting.
> 
> To be precise: any I/O in flight when reset is triggered will be
> terminated, and any subsequent I/O is short-circuited by the state check.

No, any I/O in flight before resetting is just terminated from hardware,
but still in blk-mq sw or scheduler queue, so either sync IO or queue
freezing won't make progress.

Please see nvme_complete_rq(), all these IO will be retried usually.


Thanks,
Ming