[PATCH 0/2] blk-mq/nvme: cancel request synchronously

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] blk-mq/nvme: cancel request synchronously
@ 2019-03-18  3:29 Ming Lei
  2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-18  3:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Christoph Hellwig, linux-nvme

Hi,

This patchset introduces blk_mq_complete_request_sync() for canceling
request synchronously in error handler context, then one race between
completing request and destroying contoller/queues can be fixed.


Ming Lei (2):
  blk-mq: introduce blk_mq_complete_request_sync()
  nvme: cancel request synchronously

 block/blk-mq.c           | 20 ++++++++++++++++----
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  1 +
 3 files changed, 18 insertions(+), 5 deletions(-)

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org

-- 
2.9.5


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei
@ 2019-03-18  3:29 ` Ming Lei
  2019-03-18  4:09   ` Bart Van Assche
                     ` (2 more replies)
  2019-03-18  3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei
  2019-03-27  2:06 ` [PATCH 0/2] blk-mq/nvme: " Ming Lei
  2 siblings, 3 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-18  3:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Christoph Hellwig, linux-nvme

In NVMe's error handler, follows the typical steps for tearing down
hardware:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
	mark the request as abort
	blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
actually completes the request asynchronously.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 20 ++++++++++++++++----
 include/linux/blk-mq.h |  1 +
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a9c181603cbd..8f925ac0b26d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -569,7 +569,7 @@ static void __blk_mq_complete_request_remote(void *data)
 	q->mq_ops->complete(rq);
 }
 
-static void __blk_mq_complete_request(struct request *rq)
+static void __blk_mq_complete_request(struct request *rq, bool sync)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct request_queue *q = rq->q;
@@ -586,7 +586,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	 * So complete IO reqeust in softirq context in case of single queue
 	 * for not degrading IO performance by irqsoff latency.
 	 */
-	if (q->nr_hw_queues == 1) {
+	if (q->nr_hw_queues == 1 && !sync) {
 		__blk_complete_request(rq);
 		return;
 	}
@@ -594,8 +594,11 @@ static void __blk_mq_complete_request(struct request *rq)
 	/*
 	 * For a polled request, always complete locallly, it's pointless
 	 * to redirect the completion.
+	 *
+	 * If driver requires to complete the request synchronously,
+	 * complete it locally, and it is usually done in error handler.
 	 */
-	if ((rq->cmd_flags & REQ_HIPRI) ||
+	if ((rq->cmd_flags & REQ_HIPRI) || sync ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags)) {
 		q->mq_ops->complete(rq);
 		return;
@@ -648,11 +651,20 @@ bool blk_mq_complete_request(struct request *rq)
 {
 	if (unlikely(blk_should_fake_timeout(rq->q)))
 		return false;
-	__blk_mq_complete_request(rq);
+	__blk_mq_complete_request(rq, false);
 	return true;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
+bool blk_mq_complete_request_sync(struct request *rq)
+{
+	if (unlikely(blk_should_fake_timeout(rq->q)))
+		return false;
+	__blk_mq_complete_request(rq, true);
+	return true;
+}
+EXPORT_SYMBOL(blk_mq_complete_request_sync);
+
 int blk_mq_request_started(struct request *rq)
 {
 	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b0c814bcc7e3..6a514e5136f4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -305,6 +305,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 bool blk_mq_complete_request(struct request *rq);
+bool blk_mq_complete_request_sync(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
 			   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
@ 2019-03-18  4:09   ` Bart Van Assche
  2019-03-18  7:38     ` Ming Lei
                       ` (2 more replies)
  2019-03-18 17:37   ` James Smart
  2019-03-27  8:30   ` Christoph Hellwig
  2 siblings, 3 replies; 29+ messages in thread
From: Bart Van Assche @ 2019-03-18  4:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme

On 3/17/19 8:29 PM, Ming Lei wrote:
> In NVMe's error handler, follows the typical steps for tearing down
> hardware:
> 
> 1) stop blk_mq hw queues
> 2) stop the real hw queues
> 3) cancel in-flight requests via
> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> cancel_request():
> 	mark the request as abort
> 	blk_mq_complete_request(req);
> 4) destroy real hw queues
> 
> However, there may be race between #3 and #4, because blk_mq_complete_request()
> actually completes the request asynchronously.
> 
> This patch introduces blk_mq_complete_request_sync() for fixing the
> above race.

Other block drivers wait until outstanding requests have completed by 
calling blk_cleanup_queue() before hardware queues are destroyed. Why 
can't the NVMe driver follow that approach?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  4:09   ` Bart Van Assche
@ 2019-03-18  7:38     ` Ming Lei
  2019-03-18 15:04       ` Bart Van Assche
  2019-03-21  2:13       ` Sagi Grimberg
  2019-03-18 14:40     ` Keith Busch
  2019-03-18 17:30     ` James Smart
  2 siblings, 2 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-18  7:38 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> On 3/17/19 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> 
> Other block drivers wait until outstanding requests have completed by
> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> the NVMe driver follow that approach?

The tearing down of controller can be done in error handler, in which
the request queues may not be cleaned up, almost all kinds of NVMe
controller's error handling follows the above steps, such as:

nvme_rdma_error_recovery_work()
	->nvme_rdma_teardown_io_queues()

nvme_timeout()
	->nvme_dev_disable

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  7:38     ` Ming Lei
@ 2019-03-18 15:04       ` Bart Van Assche
  2019-03-18 15:16         ` Ming Lei
  2019-03-21  2:13       ` Sagi Grimberg
  1 sibling, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2019-03-18 15:04 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, 2019-03-18 at 15:38 +0800, Ming Lei wrote:
> On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> > On 3/17/19 8:29 PM, Ming Lei wrote:
> > > In NVMe's error handler, follows the typical steps for tearing down
> > > hardware:
> > > 
> > > 1) stop blk_mq hw queues
> > > 2) stop the real hw queues
> > > 3) cancel in-flight requests via
> > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > cancel_request():
> > > 	mark the request as abort
> > > 	blk_mq_complete_request(req);
> > > 4) destroy real hw queues
> > > 
> > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > actually completes the request asynchronously.
> > > 
> > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > above race.
> > 
> > Other block drivers wait until outstanding requests have completed by
> > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> > the NVMe driver follow that approach?
> 
> The tearing down of controller can be done in error handler, in which
> the request queues may not be cleaned up, almost all kinds of NVMe
> controller's error handling follows the above steps, such as:
> 
> nvme_rdma_error_recovery_work()
> 	->nvme_rdma_teardown_io_queues()
> 
> nvme_timeout()
> 	->nvme_dev_disable

Hi Ming,

This makes me wonder whether the current design of the NVMe core is the best
design we can come up with? The structure of e.g. the SRP initiator and target
drivers is similar to the NVMeOF drivers. However, there is no need in the SRP
initiator driver to terminate requests synchronously. Is this due to
differences in the error handling approaches in the SCSI and NVMe core drivers?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18 15:04       ` Bart Van Assche
@ 2019-03-18 15:16         ` Ming Lei
  2019-03-18 15:49           ` Bart Van Assche
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-03-18 15:16 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 08:04:55AM -0700, Bart Van Assche wrote:
> On Mon, 2019-03-18 at 15:38 +0800, Ming Lei wrote:
> > On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> > > On 3/17/19 8:29 PM, Ming Lei wrote:
> > > > In NVMe's error handler, follows the typical steps for tearing down
> > > > hardware:
> > > > 
> > > > 1) stop blk_mq hw queues
> > > > 2) stop the real hw queues
> > > > 3) cancel in-flight requests via
> > > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > > cancel_request():
> > > > 	mark the request as abort
> > > > 	blk_mq_complete_request(req);
> > > > 4) destroy real hw queues
> > > > 
> > > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > > actually completes the request asynchronously.
> > > > 
> > > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > > above race.
> > > 
> > > Other block drivers wait until outstanding requests have completed by
> > > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> > > the NVMe driver follow that approach?
> > 
> > The tearing down of controller can be done in error handler, in which
> > the request queues may not be cleaned up, almost all kinds of NVMe
> > controller's error handling follows the above steps, such as:
> > 
> > nvme_rdma_error_recovery_work()
> > 	->nvme_rdma_teardown_io_queues()
> > 
> > nvme_timeout()
> > 	->nvme_dev_disable
> 
> Hi Ming,
> 
> This makes me wonder whether the current design of the NVMe core is the best
> design we can come up with? The structure of e.g. the SRP initiator and target
> drivers is similar to the NVMeOF drivers. However, there is no need in the SRP
> initiator driver to terminate requests synchronously. Is this due to

I am not familiar with SRP, could you explain what SRP initiator driver
will do when the controller is in bad state? Especially about dealing with
in-flight IO requests under this situation.

> differences in the error handling approaches in the SCSI and NVMe core drivers?

As far as I can tell, I don't see obvious design issue in NVMe host drivers,
which tries best to recover controller and retries to complete all in-flight IO.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18 15:16         ` Ming Lei
@ 2019-03-18 15:49           ` Bart Van Assche
  2019-03-18 16:06             ` Ming Lei
  2019-03-21  0:47             ` Sagi Grimberg
  0 siblings, 2 replies; 29+ messages in thread
From: Bart Van Assche @ 2019-03-18 15:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, 2019-03-18 at 23:16 +0800, Ming Lei wrote:
> I am not familiar with SRP, could you explain what SRP initiator driver
> will do when the controller is in bad state? Especially about dealing with
> in-flight IO requests under this situation.

Hi Ming,

Just like the NVMeOF initiator driver, the SRP initiator driver uses an
RDMA RC connection for all of its communication over the network. If
communication between initiator and target fails the target driver will
close the connection or one of the work requests that was posted by the
initiator driver will complete with an error status (wc->status !=
IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
try to reestablish the connection between initiator and target after a
certain delay:

	if (delay > 0)
		queue_delayed_work(system_long_wq, &rport->reconnect_work,
				   1UL * delay * HZ);

SCSI timeouts may kick the SCSI error handler. That results in calls of
the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
terminates all outstanding requests after having disconnected the RDMA RC
connection. Disconnecting the RC connection first guarantees that there
are no concurrent request completion calls from the regular completion
path and from the error handler.

Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18 15:49           ` Bart Van Assche
@ 2019-03-18 16:06             ` Ming Lei
  2019-03-21  0:47             ` Sagi Grimberg
  1 sibling, 0 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-18 16:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 08:49:24AM -0700, Bart Van Assche wrote:
> On Mon, 2019-03-18 at 23:16 +0800, Ming Lei wrote:
> > I am not familiar with SRP, could you explain what SRP initiator driver
> > will do when the controller is in bad state? Especially about dealing with
> > in-flight IO requests under this situation.
> 
> Hi Ming,
> 
> Just like the NVMeOF initiator driver, the SRP initiator driver uses an
> RDMA RC connection for all of its communication over the network. If
> communication between initiator and target fails the target driver will
> close the connection or one of the work requests that was posted by the
> initiator driver will complete with an error status (wc->status !=
> IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
> try to reestablish the connection between initiator and target after a
> certain delay:
> 
> 	if (delay > 0)
> 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
> 				   1UL * delay * HZ);
> 
> SCSI timeouts may kick the SCSI error handler. That results in calls of
> the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
> terminates all outstanding requests after having disconnected the RDMA RC
> connection.

Looks the approach of NVMe's error handler is basically similar with
above.

And NVMe just uses 'blk_mq_tagset_busy_iter(nvme_cancel_request)' to abort
in-flight requests, and I guess SCSI FC may use driver's approach to do that.

> Disconnecting the RC connection first guarantees that there
> are no concurrent request completion calls from the regular completion
> path and from the error handler.

Looks no concurrent request completion guarantee requires driver's
specific implementation.

However, this patch provides one simple approach for NVMe, then no
driver specific sync mechanism is needed.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18 15:49           ` Bart Van Assche
  2019-03-18 16:06             ` Ming Lei
@ 2019-03-21  0:47             ` Sagi Grimberg
  2019-03-21  1:39               ` Ming Lei
  2019-03-21  2:15               ` Bart Van Assche
  1 sibling, 2 replies; 29+ messages in thread
From: Sagi Grimberg @ 2019-03-21  0:47 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme


> Hi Ming,
> 
> Just like the NVMeOF initiator driver, the SRP initiator driver uses an
> RDMA RC connection for all of its communication over the network. If
> communication between initiator and target fails the target driver will
> close the connection or one of the work requests that was posted by the
> initiator driver will complete with an error status (wc->status !=
> IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
> try to reestablish the connection between initiator and target after a
> certain delay:
> 
> 	if (delay > 0)
> 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
> 				   1UL * delay * HZ);
> 
> SCSI timeouts may kick the SCSI error handler. That results in calls of
> the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
> terminates all outstanding requests after having disconnected the RDMA RC
> connection. Disconnecting the RC connection first guarantees that there
> are no concurrent request completion calls from the regular completion
> path and from the error handler.

Hi Bart,

If I understand the race correctly, its not between the requests
completion and the queue pairs removal nor the timeout handler
necessarily, but rather it is between the async requests completion and
the tagset deallocation.

Think of surprise removal (or disconnect) during I/O, drivers
usually stop/quiesce/freeze the queues, terminate/abort inflight
I/Os and then teardown the hw queues and the tagset.

IIRC, the same race holds for srp if this happens during I/O:
1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() 
-> __rport_fail_io_fast()

2. complete all I/Os (async remotely via smp)

Then continue..

3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()

What is preventing (3) from happening before (2) if its async? I would
think that scsi drivers need the exact same thing...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-21  0:47             ` Sagi Grimberg
@ 2019-03-21  1:39               ` Ming Lei
  2019-03-21  2:04                 ` Sagi Grimberg
  2019-03-21  2:15               ` Bart Van Assche
  1 sibling, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-03-21  1:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	linux-nvme

On Wed, Mar 20, 2019 at 05:47:01PM -0700, Sagi Grimberg wrote:
> 
> > Hi Ming,
> > 
> > Just like the NVMeOF initiator driver, the SRP initiator driver uses an
> > RDMA RC connection for all of its communication over the network. If
> > communication between initiator and target fails the target driver will
> > close the connection or one of the work requests that was posted by the
> > initiator driver will complete with an error status (wc->status !=
> > IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will
> > try to reestablish the connection between initiator and target after a
> > certain delay:
> > 
> > 	if (delay > 0)
> > 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
> > 				   1UL * delay * HZ);
> > 
> > SCSI timeouts may kick the SCSI error handler. That results in calls of
> > the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host()
> > terminates all outstanding requests after having disconnected the RDMA RC
> > connection. Disconnecting the RC connection first guarantees that there
> > are no concurrent request completion calls from the regular completion
> > path and from the error handler.
> 
> Hi Bart,
> 
> If I understand the race correctly, its not between the requests
> completion and the queue pairs removal nor the timeout handler
> necessarily, but rather it is between the async requests completion and
> the tagset deallocation.
> 
> Think of surprise removal (or disconnect) during I/O, drivers
> usually stop/quiesce/freeze the queues, terminate/abort inflight
> I/Os and then teardown the hw queues and the tagset.
> 
> IIRC, the same race holds for srp if this happens during I/O:
> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() ->
> __rport_fail_io_fast()
> 
> 2. complete all I/Os (async remotely via smp)
> 
> Then continue..
> 
> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
> 
> What is preventing (3) from happening before (2) if its async? I would
> think that scsi drivers need the exact same thing...

blk_cleanup_queue() will do that, but it can't be used in device recovery
obviously.

BTW, blk_mq_complete_request_sync() is a bit misleading, maybe
blk_mq_complete_request_locally() is better.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-21  1:39               ` Ming Lei
@ 2019-03-21  2:04                 ` Sagi Grimberg
  2019-03-21  2:32                   ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Sagi Grimberg @ 2019-03-21  2:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, linux-nvme,
	Christoph Hellwig


>> Hi Bart,
>>
>> If I understand the race correctly, its not between the requests
>> completion and the queue pairs removal nor the timeout handler
>> necessarily, but rather it is between the async requests completion and
>> the tagset deallocation.
>>
>> Think of surprise removal (or disconnect) during I/O, drivers
>> usually stop/quiesce/freeze the queues, terminate/abort inflight
>> I/Os and then teardown the hw queues and the tagset.
>>
>> IIRC, the same race holds for srp if this happens during I/O:
>> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() ->
>> __rport_fail_io_fast()
>>
>> 2. complete all I/Os (async remotely via smp)
>>
>> Then continue..
>>
>> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
>>
>> What is preventing (3) from happening before (2) if its async? I would
>> think that scsi drivers need the exact same thing...
> 
> blk_cleanup_queue() will do that, but it can't be used in device recovery
> obviously.

But in device recovery we never free the tagset... I might be missing
the race here then...

> BTW, blk_mq_complete_request_sync() is a bit misleading, maybe
> blk_mq_complete_request_locally() is better.

Not really...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-21  2:04                 ` Sagi Grimberg
@ 2019-03-21  2:32                   ` Ming Lei
  2019-03-21 21:40                     ` Sagi Grimberg
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-03-21  2:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, linux-block, Bart Van Assche, linux-nvme,
	Christoph Hellwig

On Wed, Mar 20, 2019 at 07:04:09PM -0700, Sagi Grimberg wrote:
> 
> > > Hi Bart,
> > > 
> > > If I understand the race correctly, its not between the requests
> > > completion and the queue pairs removal nor the timeout handler
> > > necessarily, but rather it is between the async requests completion and
> > > the tagset deallocation.
> > > 
> > > Think of surprise removal (or disconnect) during I/O, drivers
> > > usually stop/quiesce/freeze the queues, terminate/abort inflight
> > > I/Os and then teardown the hw queues and the tagset.
> > > 
> > > IIRC, the same race holds for srp if this happens during I/O:
> > > 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() ->
> > > __rport_fail_io_fast()
> > > 
> > > 2. complete all I/Os (async remotely via smp)
> > > 
> > > Then continue..
> > > 
> > > 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
> > > 
> > > What is preventing (3) from happening before (2) if its async? I would
> > > think that scsi drivers need the exact same thing...
> > 
> > blk_cleanup_queue() will do that, but it can't be used in device recovery
> > obviously.
> 
> But in device recovery we never free the tagset... I might be missing
> the race here then...

For example,

nvme_rdma_complete_rq
	->nvme_rdma_unmap_data
		->ib_mr_pool_put

But the ib queue pair may has been destroyed by nvme_rdma_destroy_io_queues()
before request's remote completion.

nvme_rdma_teardown_io_queues:
        nvme_stop_queues(&ctrl->ctrl);
        nvme_rdma_stop_io_queues(ctrl);
        blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
                        &ctrl->ctrl);
        if (remove)
                nvme_start_queues(&ctrl->ctrl);
        nvme_rdma_destroy_io_queues(ctrl, remove);


> 
> > BTW, blk_mq_complete_request_sync() is a bit misleading, maybe
> > blk_mq_complete_request_locally() is better.
> 
> Not really...

Naming is always the hard part...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-21  2:32                   ` Ming Lei
@ 2019-03-21 21:40                     ` Sagi Grimberg
  2019-03-27  8:27                       ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Sagi Grimberg @ 2019-03-21 21:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, linux-nvme,
	Christoph Hellwig


> For example,
> 
> nvme_rdma_complete_rq
> 	->nvme_rdma_unmap_data
> 		->ib_mr_pool_put
> 
> But the ib queue pair may has been destroyed by nvme_rdma_destroy_io_queues()
> before request's remote completion.
> 
> nvme_rdma_teardown_io_queues:
>          nvme_stop_queues(&ctrl->ctrl);
>          nvme_rdma_stop_io_queues(ctrl);
>          blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
>                          &ctrl->ctrl);
>          if (remove)
>                  nvme_start_queues(&ctrl->ctrl);
>          nvme_rdma_destroy_io_queues(ctrl, remove);
> 

Yea, you're right. I'm fine with this patchset.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-21 21:40                     ` Sagi Grimberg
@ 2019-03-27  8:27                       ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2019-03-27  8:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche, linux-nvme,
	Christoph Hellwig

On Thu, Mar 21, 2019 at 02:40:51PM -0700, Sagi Grimberg wrote:
>> nvme_rdma_teardown_io_queues:
>>          nvme_stop_queues(&ctrl->ctrl);
>>          nvme_rdma_stop_io_queues(ctrl);
>>          blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
>>                          &ctrl->ctrl);
>>          if (remove)
>>                  nvme_start_queues(&ctrl->ctrl);
>>          nvme_rdma_destroy_io_queues(ctrl, remove);
>>
>
> Yea, you're right. I'm fine with this patchset.

Is this an official Reviewed-by?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-21  0:47             ` Sagi Grimberg
  2019-03-21  1:39               ` Ming Lei
@ 2019-03-21  2:15               ` Bart Van Assche
  1 sibling, 0 replies; 29+ messages in thread
From: Bart Van Assche @ 2019-03-21  2:15 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On 3/20/19 5:47 PM, Sagi Grimberg wrote:
> If I understand the race correctly, its not between the requests
> completion and the queue pairs removal nor the timeout handler
> necessarily, but rather it is between the async requests completion and
> the tagset deallocation.
> 
> Think of surprise removal (or disconnect) during I/O, drivers
> usually stop/quiesce/freeze the queues, terminate/abort inflight
> I/Os and then teardown the hw queues and the tagset.
> 
> IIRC, the same race holds for srp if this happens during I/O:
> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() 
> -> __rport_fail_io_fast()
> 
> 2. complete all I/Os (async remotely via smp)
> 
> Then continue..
> 
> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags()
> 
> What is preventing (3) from happening before (2) if its async? I would
> think that scsi drivers need the exact same thing...

Hi Sagi,

As Ming already replied, I don't think that (3) can happen before (2) in 
case of the SRP driver. If you have a look at srp_remove_target() you 
will see that it calls scsi_remove_host(). That function only returns 
after blk_cleanup_queue() has been called for all associated request 
queues. As you know that function waits until all outstanding requests 
have completed.

Bart.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  7:38     ` Ming Lei
  2019-03-18 15:04       ` Bart Van Assche
@ 2019-03-21  2:13       ` Sagi Grimberg
  1 sibling, 0 replies; 29+ messages in thread
From: Sagi Grimberg @ 2019-03-21  2:13 UTC (permalink / raw)
  To: Ming Lei, Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme


>>> In NVMe's error handler, follows the typical steps for tearing down
>>> hardware:
>>>
>>> 1) stop blk_mq hw queues
>>> 2) stop the real hw queues
>>> 3) cancel in-flight requests via
>>> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>>> cancel_request():
>>> 	mark the request as abort
>>> 	blk_mq_complete_request(req);
>>> 4) destroy real hw queues
>>>
>>> However, there may be race between #3 and #4, because blk_mq_complete_request()
>>> actually completes the request asynchronously.
>>>
>>> This patch introduces blk_mq_complete_request_sync() for fixing the
>>> above race.
>>
>> Other block drivers wait until outstanding requests have completed by
>> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
>> the NVMe driver follow that approach?
> 
> The tearing down of controller can be done in error handler, in which
> the request queues may not be cleaned up, almost all kinds of NVMe
> controller's error handling follows the above steps, such as:
> 
> nvme_rdma_error_recovery_work()
> 	->nvme_rdma_teardown_io_queues()

Clarification, this happens in its dedicated context, not the timeout or
error handler.

But I still don't understand the issue here, what is the exact race you
are referring to? that we abort/cancel a request and then we complete
it again when we destroy the HW queue?

If so, that is not the case (at least for rdma/tcp) because
nvme_rdma_teardown_io_queues() first flushes the hw queue and then
aborts inflight I/O.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  4:09   ` Bart Van Assche
  2019-03-18  7:38     ` Ming Lei
@ 2019-03-18 14:40     ` Keith Busch
  2019-03-18 17:30     ` James Smart
  2 siblings, 0 replies; 29+ messages in thread
From: Keith Busch @ 2019-03-18 14:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote:
> On 3/17/19 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> 
> Other block drivers wait until outstanding requests have completed by
> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't
> the NVMe driver follow that approach?

You can't just wait for an outstanding request indefinitely. We have to
safely make forward progress when we've determined it's not going to be
completed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  4:09   ` Bart Van Assche
  2019-03-18  7:38     ` Ming Lei
  2019-03-18 14:40     ` Keith Busch
@ 2019-03-18 17:30     ` James Smart
  2 siblings, 0 replies; 29+ messages in thread
From: James Smart @ 2019-03-18 17:30 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, linux-nvme


On 3/17/2019 9:09 PM, Bart Van Assche wrote:
> On 3/17/19 8:29 PM, Ming Lei wrote:
>> In NVMe's error handler, follows the typical steps for tearing down
>> hardware:
>>
>> 1) stop blk_mq hw queues
>> 2) stop the real hw queues
>> 3) cancel in-flight requests via
>>     blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>> cancel_request():
>>     mark the request as abort
>>     blk_mq_complete_request(req);
>> 4) destroy real hw queues
>>
>> However, there may be race between #3 and #4, because 
>> blk_mq_complete_request()
>> actually completes the request asynchronously.
>>
>> This patch introduces blk_mq_complete_request_sync() for fixing the
>> above race.
>
> Other block drivers wait until outstanding requests have completed by 
> calling blk_cleanup_queue() before hardware queues are destroyed. Why 
> can't the NVMe driver follow that approach?
>

speaking for the fabrics, not necessarily pci:

The intent of this looping, which happens immediately following an error 
being detected, is to cause the termination of the outstanding requests. 
Otherwise, the only recourse is to wait for the ios to finish, which 
they may never do, or have their upper-level timeout expire to cause 
their termination - thus a very long delay.   And one of the commands, 
on the admin queue - a different tag set but handled the same, doesn't 
have a timeout (the Async Event Reporting command) so it wouldn't 
necessarily clear without this looping.

-- james


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
  2019-03-18  4:09   ` Bart Van Assche
@ 2019-03-18 17:37   ` James Smart
  2019-03-19  1:06     ` Ming Lei
  2019-03-19  1:31     ` Ming Lei
  2019-03-27  8:30   ` Christoph Hellwig
  2 siblings, 2 replies; 29+ messages in thread
From: James Smart @ 2019-03-18 17:37 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme



On 3/17/2019 8:29 PM, Ming Lei wrote:
> In NVMe's error handler, follows the typical steps for tearing down
> hardware:
>
> 1) stop blk_mq hw queues
> 2) stop the real hw queues
> 3) cancel in-flight requests via
> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> cancel_request():
> 	mark the request as abort
> 	blk_mq_complete_request(req);
> 4) destroy real hw queues
>
> However, there may be race between #3 and #4, because blk_mq_complete_request()
> actually completes the request asynchronously.
>
> This patch introduces blk_mq_complete_request_sync() for fixing the
> above race.
>

This won't help FC at all. Inherently, the "completion" has to be 
asynchronous as line traffic may be required.

e.g. FC doesn't use nvme_complete_request() in the iterator routine.

-- james


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18 17:37   ` James Smart
@ 2019-03-19  1:06     ` Ming Lei
  2019-03-19  3:37       ` James Smart
  2019-03-19  1:31     ` Ming Lei
  1 sibling, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-03-19  1:06 UTC (permalink / raw)
  To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> 
> 
> On 3/17/2019 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> > 
> 
> This won't help FC at all. Inherently, the "completion" has to be
> asynchronous as line traffic may be required.
> 
> e.g. FC doesn't use nvme_complete_request() in the iterator routine.

Yeah, I saw the FC code, it is supposed to address the asynchronous
completion of blk_mq_complete_request() in error handler.

Also I think it is always the correct thing to abort requests
synchronously in error handler, isn't it?


thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-19  1:06     ` Ming Lei
@ 2019-03-19  3:37       ` James Smart
  2019-03-19  3:50         ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: James Smart @ 2019-03-19  3:37 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme



On 3/18/2019 6:06 PM, Ming Lei wrote:
> On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
>>
>> On 3/17/2019 8:29 PM, Ming Lei wrote:
>>> In NVMe's error handler, follows the typical steps for tearing down
>>> hardware:
>>>
>>> 1) stop blk_mq hw queues
>>> 2) stop the real hw queues
>>> 3) cancel in-flight requests via
>>> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>>> cancel_request():
>>> 	mark the request as abort
>>> 	blk_mq_complete_request(req);
>>> 4) destroy real hw queues
>>>
>>> However, there may be race between #3 and #4, because blk_mq_complete_request()
>>> actually completes the request asynchronously.
>>>
>>> This patch introduces blk_mq_complete_request_sync() for fixing the
>>> above race.
>>>
>> This won't help FC at all. Inherently, the "completion" has to be
>> asynchronous as line traffic may be required.
>>
>> e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> Yeah, I saw the FC code, it is supposed to address the asynchronous
> completion of blk_mq_complete_request() in error handler.
>
> Also I think it is always the correct thing to abort requests
> synchronously in error handler, isn't it?
>

not sure I fully follow you, but if you're asking shouldn't it always be 
synchronous - why would that be the case ?  I really don't want a 
blocking thread that could block for several seconds on a single io to 
complete.  The controller has changed state and the queues frozen which 
should have been sufficient - but bottom-end io can still complete at 
any time.

-- james


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-19  3:37       ` James Smart
@ 2019-03-19  3:50         ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-19  3:50 UTC (permalink / raw)
  To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 08:37:35PM -0700, James Smart wrote:
> 
> 
> On 3/18/2019 6:06 PM, Ming Lei wrote:
> > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> > > 
> > > On 3/17/2019 8:29 PM, Ming Lei wrote:
> > > > In NVMe's error handler, follows the typical steps for tearing down
> > > > hardware:
> > > > 
> > > > 1) stop blk_mq hw queues
> > > > 2) stop the real hw queues
> > > > 3) cancel in-flight requests via
> > > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > > cancel_request():
> > > > 	mark the request as abort
> > > > 	blk_mq_complete_request(req);
> > > > 4) destroy real hw queues
> > > > 
> > > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > > actually completes the request asynchronously.
> > > > 
> > > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > > above race.
> > > > 
> > > This won't help FC at all. Inherently, the "completion" has to be
> > > asynchronous as line traffic may be required.
> > > 
> > > e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> > Yeah, I saw the FC code, it is supposed to address the asynchronous
> > completion of blk_mq_complete_request() in error handler.
> > 
> > Also I think it is always the correct thing to abort requests
> > synchronously in error handler, isn't it?
> > 
> 
> not sure I fully follow you, but if you're asking shouldn't it always be
> synchronous - why would that be the case ?  I really don't want a blocking
> thread that could block for several seconds on a single io to complete.  The

We are talking error handler, in which all in-flight requests are simply
aborted via blk_mq_tagset_busy_iter(nvme_cancel_request, ...), and there isn't
any waiting for single io to complete. nvme_cancel_request() basically
re-queues the in-flight request to blk-mq's queues, and the time is
pretty short, and I guess blk_mq_complete_request_sync() should be quicker
than blk_mq_complete_request() under this situation.

> controller has changed state and the queues frozen which should have been
> sufficient - but bottom-end io can still complete at any time.

Queues have been quiesced or stopped for recovering, and queue freezing
requires to wait for completion of all in-flight requests, then a new
IO deadlock is made...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18 17:37   ` James Smart
  2019-03-19  1:06     ` Ming Lei
@ 2019-03-19  1:31     ` Ming Lei
  2019-03-19  4:04       ` James Smart
  1 sibling, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-03-19  1:31 UTC (permalink / raw)
  To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> 
> 
> On 3/17/2019 8:29 PM, Ming Lei wrote:
> > In NVMe's error handler, follows the typical steps for tearing down
> > hardware:
> > 
> > 1) stop blk_mq hw queues
> > 2) stop the real hw queues
> > 3) cancel in-flight requests via
> > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > cancel_request():
> > 	mark the request as abort
> > 	blk_mq_complete_request(req);
> > 4) destroy real hw queues
> > 
> > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > actually completes the request asynchronously.
> > 
> > This patch introduces blk_mq_complete_request_sync() for fixing the
> > above race.
> > 
> 
> This won't help FC at all. Inherently, the "completion" has to be
> asynchronous as line traffic may be required.
> 
> e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> 

Looks FC has done the sync already, see nvme_fc_delete_association():

		...
        /* wait for all io that had to be aborted */
        spin_lock_irq(&ctrl->lock);
        wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
        ctrl->flags &= ~FCCTRL_TERMIO;
        spin_unlock_irq(&ctrl->lock);

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-19  1:31     ` Ming Lei
@ 2019-03-19  4:04       ` James Smart
  2019-03-19  4:28         ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: James Smart @ 2019-03-19  4:04 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme



On 3/18/2019 6:31 PM, Ming Lei wrote:
> On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
>>
>> On 3/17/2019 8:29 PM, Ming Lei wrote:
>>> In NVMe's error handler, follows the typical steps for tearing down
>>> hardware:
>>>
>>> 1) stop blk_mq hw queues
>>> 2) stop the real hw queues
>>> 3) cancel in-flight requests via
>>> 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
>>> cancel_request():
>>> 	mark the request as abort
>>> 	blk_mq_complete_request(req);
>>> 4) destroy real hw queues
>>>
>>> However, there may be race between #3 and #4, because blk_mq_complete_request()
>>> actually completes the request asynchronously.
>>>
>>> This patch introduces blk_mq_complete_request_sync() for fixing the
>>> above race.
>>>
>> This won't help FC at all. Inherently, the "completion" has to be
>> asynchronous as line traffic may be required.
>>
>> e.g. FC doesn't use nvme_complete_request() in the iterator routine.
>>
> Looks FC has done the sync already, see nvme_fc_delete_association():
>
> 		...
>          /* wait for all io that had to be aborted */
>          spin_lock_irq(&ctrl->lock);
>          wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
>          ctrl->flags &= ~FCCTRL_TERMIO;
>          spin_unlock_irq(&ctrl->lock);

yes - but the iterator started a lot of the back end io terminating in 
parallel. So waiting on many happening in parallel is better than 
waiting 1 at a time.   Even so, I've always disliked this wait and would 
have preferred to exit the thread with something monitoring the 
completions re-queuing a work thread to finish.

-- james


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-19  4:04       ` James Smart
@ 2019-03-19  4:28         ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-19  4:28 UTC (permalink / raw)
  To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 09:04:37PM -0700, James Smart wrote:
> 
> 
> On 3/18/2019 6:31 PM, Ming Lei wrote:
> > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote:
> > > 
> > > On 3/17/2019 8:29 PM, Ming Lei wrote:
> > > > In NVMe's error handler, follows the typical steps for tearing down
> > > > hardware:
> > > > 
> > > > 1) stop blk_mq hw queues
> > > > 2) stop the real hw queues
> > > > 3) cancel in-flight requests via
> > > > 	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
> > > > cancel_request():
> > > > 	mark the request as abort
> > > > 	blk_mq_complete_request(req);
> > > > 4) destroy real hw queues
> > > > 
> > > > However, there may be race between #3 and #4, because blk_mq_complete_request()
> > > > actually completes the request asynchronously.
> > > > 
> > > > This patch introduces blk_mq_complete_request_sync() for fixing the
> > > > above race.
> > > > 
> > > This won't help FC at all. Inherently, the "completion" has to be
> > > asynchronous as line traffic may be required.
> > > 
> > > e.g. FC doesn't use nvme_complete_request() in the iterator routine.
> > > 
> > Looks FC has done the sync already, see nvme_fc_delete_association():
> > 
> > 		...
> >          /* wait for all io that had to be aborted */
> >          spin_lock_irq(&ctrl->lock);
> >          wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
> >          ctrl->flags &= ~FCCTRL_TERMIO;
> >          spin_unlock_irq(&ctrl->lock);
> 
> yes - but the iterator started a lot of the back end io terminating in
> parallel. So waiting on many happening in parallel is better than waiting 1
> at a time.

OK, that is FC's sync, not related with this patch.

> Even so, I've always disliked this wait and would have
> preferred to exit the thread with something monitoring the completions
> re-queuing a work thread to finish.

Then I guess you may like this patch given it actually avoids the
potential wait, :-)

What the patch does is to convert the remote completion(#1) into local
completion(#2):

1) previously one request may be completed remotely by blk_mq_complete_request():

         rq->csd.func = __blk_mq_complete_request_remote;
         rq->csd.info = rq;
         rq->csd.flags = 0;
         smp_call_function_single_async(ctx->cpu, &rq->csd);

2) this patch changes the remote completion into local completion via
blk_mq_complete_request_sync(), so all in-flight requests can be aborted
before destroying queue.

		q->mq_ops->complete(rq);

As I mentioned in another email, there isn't any waiting for aborting
request, nvme_cancel_request() simply requeues the request to blk-mq
under this situation.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync()
  2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
  2019-03-18  4:09   ` Bart Van Assche
  2019-03-18 17:37   ` James Smart
@ 2019-03-27  8:30   ` Christoph Hellwig
  2 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2019-03-27  8:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

Please make this an EXPORT_SYMBOL_GPL as for all new low-level blk-mq
exports.  Otherwise looks fine modolo any minor documentatio nitpicks
that I might have seen in the thread:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/2] nvme: cancel request synchronously
  2019-03-18  3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei
  2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
@ 2019-03-18  3:29 ` Ming Lei
  2019-03-27  8:30   ` Christoph Hellwig
  2019-03-27  2:06 ` [PATCH 0/2] blk-mq/nvme: " Ming Lei
  2 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-03-18  3:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, Christoph Hellwig, linux-nvme

nvme_cancel_request() is used in error handler, and it is always
reliable to cancel request synchronously, and avoids possible race
in which request may be completed after real hw queue is destroyed.

One issue is reported by our customer on NVMe RDMA, in which freed ib
queue pair may be used in nvme_rdma_complete_rq().

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 470601980794..2c43e12b70af 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -288,7 +288,7 @@ bool nvme_cancel_request(struct request *req, void *data, bool reserved)
 				"Cancelling I/O %d", req->tag);
 
 	nvme_req(req)->status = NVME_SC_ABORT_REQ;
-	blk_mq_complete_request(req);
+	blk_mq_complete_request_sync(req);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] nvme: cancel request synchronously
  2019-03-18  3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei
@ 2019-03-27  8:30   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2019-03-27  8:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] blk-mq/nvme: cancel request synchronously
  2019-03-18  3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei
  2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
  2019-03-18  3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei
@ 2019-03-27  2:06 ` Ming Lei
  2 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-27  2:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme

On Mon, Mar 18, 2019 at 11:29:48AM +0800, Ming Lei wrote:
> Hi,
> 
> This patchset introduces blk_mq_complete_request_sync() for canceling
> request synchronously in error handler context, then one race between
> completing request and destroying contoller/queues can be fixed.
> 
> 
> Ming Lei (2):
>   blk-mq: introduce blk_mq_complete_request_sync()
>   nvme: cancel request synchronously
> 
>  block/blk-mq.c           | 20 ++++++++++++++++----
>  drivers/nvme/host/core.c |  2 +-
>  include/linux/blk-mq.h   |  1 +
>  3 files changed, 18 insertions(+), 5 deletions(-)
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: linux-nvme@lists.infradead.org
> 
> -- 
> 2.9.5
> 

Hi Jens and Christoph,

This two simple patches do fix kernel oops in the "Link flap / Failover
testing an NVMe over RDMA connection" from our customer.

Looks no one objects this fix, could you consider them for v5.1 if you
are fine?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-03-27  8:30 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-18  3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei
2019-03-18  3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
2019-03-18  4:09   ` Bart Van Assche
2019-03-18  7:38     ` Ming Lei
2019-03-18 15:04       ` Bart Van Assche
2019-03-18 15:16         ` Ming Lei
2019-03-18 15:49           ` Bart Van Assche
2019-03-18 16:06             ` Ming Lei
2019-03-21  0:47             ` Sagi Grimberg
2019-03-21  1:39               ` Ming Lei
2019-03-21  2:04                 ` Sagi Grimberg
2019-03-21  2:32                   ` Ming Lei
2019-03-21 21:40                     ` Sagi Grimberg
2019-03-27  8:27                       ` Christoph Hellwig
2019-03-21  2:15               ` Bart Van Assche
2019-03-21  2:13       ` Sagi Grimberg
2019-03-18 14:40     ` Keith Busch
2019-03-18 17:30     ` James Smart
2019-03-18 17:37   ` James Smart
2019-03-19  1:06     ` Ming Lei
2019-03-19  3:37       ` James Smart
2019-03-19  3:50         ` Ming Lei
2019-03-19  1:31     ` Ming Lei
2019-03-19  4:04       ` James Smart
2019-03-19  4:28         ` Ming Lei
2019-03-27  8:30   ` Christoph Hellwig
2019-03-18  3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei
2019-03-27  8:30   ` Christoph Hellwig
2019-03-27  2:06 ` [PATCH 0/2] blk-mq/nvme: " Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).