* [PATCH 0/2] blk-mq/nvme: cancel request synchronously
@ 2019-03-18 3:29 Ming Lei
2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Ming Lei @ 2019-03-18 3:29 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Ming Lei, Christoph Hellwig, linux-nvme
Hi,
This patchset introduces blk_mq_complete_request_sync() for canceling
request synchronously in error handler context, then one race between
completing request and destroying contoller/queues can be fixed.
Ming Lei (2):
blk-mq: introduce blk_mq_complete_request_sync()
nvme: cancel request synchronously
block/blk-mq.c | 20 ++++++++++++++++----
drivers/nvme/host/core.c | 2 +-
include/linux/blk-mq.h | 1 +
3 files changed, 18 insertions(+), 5 deletions(-)
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
--
2.9.5
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei @ 2019-03-18 3:29 ` Ming Lei 2019-03-18 4:09 ` Bart Van Assche ` (2 more replies) 2019-03-18 3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei 2019-03-27 2:06 ` [PATCH 0/2] blk-mq/nvme: " Ming Lei 2 siblings, 3 replies; 29+ messages in thread From: Ming Lei @ 2019-03-18 3:29 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Ming Lei, Christoph Hellwig, linux-nvme In NVMe's error handler, follows the typical steps for tearing down hardware: 1) stop blk_mq hw queues 2) stop the real hw queues 3) cancel in-flight requests via blk_mq_tagset_busy_iter(tags, cancel_request, ...) cancel_request(): mark the request as abort blk_mq_complete_request(req); 4) destroy real hw queues However, there may be race between #3 and #4, because blk_mq_complete_request() actually completes the request asynchronously. This patch introduces blk_mq_complete_request_sync() for fixing the above race. Cc: Christoph Hellwig <hch@lst.de> Cc: linux-nvme@lists.infradead.org Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq.c | 20 ++++++++++++++++---- include/linux/blk-mq.h | 1 + 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index a9c181603cbd..8f925ac0b26d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -569,7 +569,7 @@ static void __blk_mq_complete_request_remote(void *data) q->mq_ops->complete(rq); } -static void __blk_mq_complete_request(struct request *rq) +static void __blk_mq_complete_request(struct request *rq, bool sync) { struct blk_mq_ctx *ctx = rq->mq_ctx; struct request_queue *q = rq->q; @@ -586,7 +586,7 @@ static void __blk_mq_complete_request(struct request *rq) * So complete IO reqeust in softirq context in case of single queue * for not degrading IO performance by irqsoff latency. */ - if (q->nr_hw_queues == 1) { + if (q->nr_hw_queues == 1 && !sync) { __blk_complete_request(rq); return; } @@ -594,8 +594,11 @@ static void __blk_mq_complete_request(struct request *rq) /* * For a polled request, always complete locallly, it's pointless * to redirect the completion. + * + * If driver requires to complete the request synchronously, + * complete it locally, and it is usually done in error handler. */ - if ((rq->cmd_flags & REQ_HIPRI) || + if ((rq->cmd_flags & REQ_HIPRI) || sync || !test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags)) { q->mq_ops->complete(rq); return; @@ -648,11 +651,20 @@ bool blk_mq_complete_request(struct request *rq) { if (unlikely(blk_should_fake_timeout(rq->q))) return false; - __blk_mq_complete_request(rq); + __blk_mq_complete_request(rq, false); return true; } EXPORT_SYMBOL(blk_mq_complete_request); +bool blk_mq_complete_request_sync(struct request *rq) +{ + if (unlikely(blk_should_fake_timeout(rq->q))) + return false; + __blk_mq_complete_request(rq, true); + return true; +} +EXPORT_SYMBOL(blk_mq_complete_request_sync); + int blk_mq_request_started(struct request *rq) { return blk_mq_rq_state(rq) != MQ_RQ_IDLE; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index b0c814bcc7e3..6a514e5136f4 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -305,6 +305,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head, void blk_mq_kick_requeue_list(struct request_queue *q); void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs); bool blk_mq_complete_request(struct request *rq); +bool blk_mq_complete_request_sync(struct request *rq); bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list, struct bio *bio); bool blk_mq_queue_stopped(struct request_queue *q); -- 2.9.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei @ 2019-03-18 4:09 ` Bart Van Assche 2019-03-18 7:38 ` Ming Lei ` (2 more replies) 2019-03-18 17:37 ` James Smart 2019-03-27 8:30 ` Christoph Hellwig 2 siblings, 3 replies; 29+ messages in thread From: Bart Van Assche @ 2019-03-18 4:09 UTC (permalink / raw) To: Ming Lei, Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme On 3/17/19 8:29 PM, Ming Lei wrote: > In NVMe's error handler, follows the typical steps for tearing down > hardware: > > 1) stop blk_mq hw queues > 2) stop the real hw queues > 3) cancel in-flight requests via > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > cancel_request(): > mark the request as abort > blk_mq_complete_request(req); > 4) destroy real hw queues > > However, there may be race between #3 and #4, because blk_mq_complete_request() > actually completes the request asynchronously. > > This patch introduces blk_mq_complete_request_sync() for fixing the > above race. Other block drivers wait until outstanding requests have completed by calling blk_cleanup_queue() before hardware queues are destroyed. Why can't the NVMe driver follow that approach? Thanks, Bart. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 4:09 ` Bart Van Assche @ 2019-03-18 7:38 ` Ming Lei 2019-03-18 15:04 ` Bart Van Assche 2019-03-21 2:13 ` Sagi Grimberg 2019-03-18 14:40 ` Keith Busch 2019-03-18 17:30 ` James Smart 2 siblings, 2 replies; 29+ messages in thread From: Ming Lei @ 2019-03-18 7:38 UTC (permalink / raw) To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote: > On 3/17/19 8:29 PM, Ming Lei wrote: > > In NVMe's error handler, follows the typical steps for tearing down > > hardware: > > > > 1) stop blk_mq hw queues > > 2) stop the real hw queues > > 3) cancel in-flight requests via > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > cancel_request(): > > mark the request as abort > > blk_mq_complete_request(req); > > 4) destroy real hw queues > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > actually completes the request asynchronously. > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > above race. > > Other block drivers wait until outstanding requests have completed by > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't > the NVMe driver follow that approach? The tearing down of controller can be done in error handler, in which the request queues may not be cleaned up, almost all kinds of NVMe controller's error handling follows the above steps, such as: nvme_rdma_error_recovery_work() ->nvme_rdma_teardown_io_queues() nvme_timeout() ->nvme_dev_disable Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 7:38 ` Ming Lei @ 2019-03-18 15:04 ` Bart Van Assche 2019-03-18 15:16 ` Ming Lei 2019-03-21 2:13 ` Sagi Grimberg 1 sibling, 1 reply; 29+ messages in thread From: Bart Van Assche @ 2019-03-18 15:04 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, 2019-03-18 at 15:38 +0800, Ming Lei wrote: > On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote: > > On 3/17/19 8:29 PM, Ming Lei wrote: > > > In NVMe's error handler, follows the typical steps for tearing down > > > hardware: > > > > > > 1) stop blk_mq hw queues > > > 2) stop the real hw queues > > > 3) cancel in-flight requests via > > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > > cancel_request(): > > > mark the request as abort > > > blk_mq_complete_request(req); > > > 4) destroy real hw queues > > > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > > actually completes the request asynchronously. > > > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > > above race. > > > > Other block drivers wait until outstanding requests have completed by > > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't > > the NVMe driver follow that approach? > > The tearing down of controller can be done in error handler, in which > the request queues may not be cleaned up, almost all kinds of NVMe > controller's error handling follows the above steps, such as: > > nvme_rdma_error_recovery_work() > ->nvme_rdma_teardown_io_queues() > > nvme_timeout() > ->nvme_dev_disable Hi Ming, This makes me wonder whether the current design of the NVMe core is the best design we can come up with? The structure of e.g. the SRP initiator and target drivers is similar to the NVMeOF drivers. However, there is no need in the SRP initiator driver to terminate requests synchronously. Is this due to differences in the error handling approaches in the SCSI and NVMe core drivers? Thanks, Bart. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 15:04 ` Bart Van Assche @ 2019-03-18 15:16 ` Ming Lei 2019-03-18 15:49 ` Bart Van Assche 0 siblings, 1 reply; 29+ messages in thread From: Ming Lei @ 2019-03-18 15:16 UTC (permalink / raw) To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 08:04:55AM -0700, Bart Van Assche wrote: > On Mon, 2019-03-18 at 15:38 +0800, Ming Lei wrote: > > On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote: > > > On 3/17/19 8:29 PM, Ming Lei wrote: > > > > In NVMe's error handler, follows the typical steps for tearing down > > > > hardware: > > > > > > > > 1) stop blk_mq hw queues > > > > 2) stop the real hw queues > > > > 3) cancel in-flight requests via > > > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > > > cancel_request(): > > > > mark the request as abort > > > > blk_mq_complete_request(req); > > > > 4) destroy real hw queues > > > > > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > > > actually completes the request asynchronously. > > > > > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > > > above race. > > > > > > Other block drivers wait until outstanding requests have completed by > > > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't > > > the NVMe driver follow that approach? > > > > The tearing down of controller can be done in error handler, in which > > the request queues may not be cleaned up, almost all kinds of NVMe > > controller's error handling follows the above steps, such as: > > > > nvme_rdma_error_recovery_work() > > ->nvme_rdma_teardown_io_queues() > > > > nvme_timeout() > > ->nvme_dev_disable > > Hi Ming, > > This makes me wonder whether the current design of the NVMe core is the best > design we can come up with? The structure of e.g. the SRP initiator and target > drivers is similar to the NVMeOF drivers. However, there is no need in the SRP > initiator driver to terminate requests synchronously. Is this due to I am not familiar with SRP, could you explain what SRP initiator driver will do when the controller is in bad state? Especially about dealing with in-flight IO requests under this situation. > differences in the error handling approaches in the SCSI and NVMe core drivers? As far as I can tell, I don't see obvious design issue in NVMe host drivers, which tries best to recover controller and retries to complete all in-flight IO. Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 15:16 ` Ming Lei @ 2019-03-18 15:49 ` Bart Van Assche 2019-03-18 16:06 ` Ming Lei 2019-03-21 0:47 ` Sagi Grimberg 0 siblings, 2 replies; 29+ messages in thread From: Bart Van Assche @ 2019-03-18 15:49 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, 2019-03-18 at 23:16 +0800, Ming Lei wrote: > I am not familiar with SRP, could you explain what SRP initiator driver > will do when the controller is in bad state? Especially about dealing with > in-flight IO requests under this situation. Hi Ming, Just like the NVMeOF initiator driver, the SRP initiator driver uses an RDMA RC connection for all of its communication over the network. If communication between initiator and target fails the target driver will close the connection or one of the work requests that was posted by the initiator driver will complete with an error status (wc->status != IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will try to reestablish the connection between initiator and target after a certain delay: if (delay > 0) queue_delayed_work(system_long_wq, &rport->reconnect_work, 1UL * delay * HZ); SCSI timeouts may kick the SCSI error handler. That results in calls of the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host() terminates all outstanding requests after having disconnected the RDMA RC connection. Disconnecting the RC connection first guarantees that there are no concurrent request completion calls from the regular completion path and from the error handler. Bart. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 15:49 ` Bart Van Assche @ 2019-03-18 16:06 ` Ming Lei 2019-03-21 0:47 ` Sagi Grimberg 1 sibling, 0 replies; 29+ messages in thread From: Ming Lei @ 2019-03-18 16:06 UTC (permalink / raw) To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 08:49:24AM -0700, Bart Van Assche wrote: > On Mon, 2019-03-18 at 23:16 +0800, Ming Lei wrote: > > I am not familiar with SRP, could you explain what SRP initiator driver > > will do when the controller is in bad state? Especially about dealing with > > in-flight IO requests under this situation. > > Hi Ming, > > Just like the NVMeOF initiator driver, the SRP initiator driver uses an > RDMA RC connection for all of its communication over the network. If > communication between initiator and target fails the target driver will > close the connection or one of the work requests that was posted by the > initiator driver will complete with an error status (wc->status != > IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will > try to reestablish the connection between initiator and target after a > certain delay: > > if (delay > 0) > queue_delayed_work(system_long_wq, &rport->reconnect_work, > 1UL * delay * HZ); > > SCSI timeouts may kick the SCSI error handler. That results in calls of > the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host() > terminates all outstanding requests after having disconnected the RDMA RC > connection. Looks the approach of NVMe's error handler is basically similar with above. And NVMe just uses 'blk_mq_tagset_busy_iter(nvme_cancel_request)' to abort in-flight requests, and I guess SCSI FC may use driver's approach to do that. > Disconnecting the RC connection first guarantees that there > are no concurrent request completion calls from the regular completion > path and from the error handler. Looks no concurrent request completion guarantee requires driver's specific implementation. However, this patch provides one simple approach for NVMe, then no driver specific sync mechanism is needed. Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 15:49 ` Bart Van Assche 2019-03-18 16:06 ` Ming Lei @ 2019-03-21 0:47 ` Sagi Grimberg 2019-03-21 1:39 ` Ming Lei 2019-03-21 2:15 ` Bart Van Assche 1 sibling, 2 replies; 29+ messages in thread From: Sagi Grimberg @ 2019-03-21 0:47 UTC (permalink / raw) To: Bart Van Assche, Ming Lei Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme > Hi Ming, > > Just like the NVMeOF initiator driver, the SRP initiator driver uses an > RDMA RC connection for all of its communication over the network. If > communication between initiator and target fails the target driver will > close the connection or one of the work requests that was posted by the > initiator driver will complete with an error status (wc->status != > IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will > try to reestablish the connection between initiator and target after a > certain delay: > > if (delay > 0) > queue_delayed_work(system_long_wq, &rport->reconnect_work, > 1UL * delay * HZ); > > SCSI timeouts may kick the SCSI error handler. That results in calls of > the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host() > terminates all outstanding requests after having disconnected the RDMA RC > connection. Disconnecting the RC connection first guarantees that there > are no concurrent request completion calls from the regular completion > path and from the error handler. Hi Bart, If I understand the race correctly, its not between the requests completion and the queue pairs removal nor the timeout handler necessarily, but rather it is between the async requests completion and the tagset deallocation. Think of surprise removal (or disconnect) during I/O, drivers usually stop/quiesce/freeze the queues, terminate/abort inflight I/Os and then teardown the hw queues and the tagset. IIRC, the same race holds for srp if this happens during I/O: 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() -> __rport_fail_io_fast() 2. complete all I/Os (async remotely via smp) Then continue.. 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags() What is preventing (3) from happening before (2) if its async? I would think that scsi drivers need the exact same thing... ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-21 0:47 ` Sagi Grimberg @ 2019-03-21 1:39 ` Ming Lei 2019-03-21 2:04 ` Sagi Grimberg 2019-03-21 2:15 ` Bart Van Assche 1 sibling, 1 reply; 29+ messages in thread From: Ming Lei @ 2019-03-21 1:39 UTC (permalink / raw) To: Sagi Grimberg Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Wed, Mar 20, 2019 at 05:47:01PM -0700, Sagi Grimberg wrote: > > > Hi Ming, > > > > Just like the NVMeOF initiator driver, the SRP initiator driver uses an > > RDMA RC connection for all of its communication over the network. If > > communication between initiator and target fails the target driver will > > close the connection or one of the work requests that was posted by the > > initiator driver will complete with an error status (wc->status != > > IB_WC_SUCCESS). In the latter case the function srp_handle_qp_err() will > > try to reestablish the connection between initiator and target after a > > certain delay: > > > > if (delay > 0) > > queue_delayed_work(system_long_wq, &rport->reconnect_work, > > 1UL * delay * HZ); > > > > SCSI timeouts may kick the SCSI error handler. That results in calls of > > the srp_reset_device() and/or srp_reset_host() functions. srp_reset_host() > > terminates all outstanding requests after having disconnected the RDMA RC > > connection. Disconnecting the RC connection first guarantees that there > > are no concurrent request completion calls from the regular completion > > path and from the error handler. > > Hi Bart, > > If I understand the race correctly, its not between the requests > completion and the queue pairs removal nor the timeout handler > necessarily, but rather it is between the async requests completion and > the tagset deallocation. > > Think of surprise removal (or disconnect) during I/O, drivers > usually stop/quiesce/freeze the queues, terminate/abort inflight > I/Os and then teardown the hw queues and the tagset. > > IIRC, the same race holds for srp if this happens during I/O: > 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() -> > __rport_fail_io_fast() > > 2. complete all I/Os (async remotely via smp) > > Then continue.. > > 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags() > > What is preventing (3) from happening before (2) if its async? I would > think that scsi drivers need the exact same thing... blk_cleanup_queue() will do that, but it can't be used in device recovery obviously. BTW, blk_mq_complete_request_sync() is a bit misleading, maybe blk_mq_complete_request_locally() is better. Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-21 1:39 ` Ming Lei @ 2019-03-21 2:04 ` Sagi Grimberg 2019-03-21 2:32 ` Ming Lei 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2019-03-21 2:04 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, linux-block, Bart Van Assche, linux-nvme, Christoph Hellwig >> Hi Bart, >> >> If I understand the race correctly, its not between the requests >> completion and the queue pairs removal nor the timeout handler >> necessarily, but rather it is between the async requests completion and >> the tagset deallocation. >> >> Think of surprise removal (or disconnect) during I/O, drivers >> usually stop/quiesce/freeze the queues, terminate/abort inflight >> I/Os and then teardown the hw queues and the tagset. >> >> IIRC, the same race holds for srp if this happens during I/O: >> 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() -> >> __rport_fail_io_fast() >> >> 2. complete all I/Os (async remotely via smp) >> >> Then continue.. >> >> 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags() >> >> What is preventing (3) from happening before (2) if its async? I would >> think that scsi drivers need the exact same thing... > > blk_cleanup_queue() will do that, but it can't be used in device recovery > obviously. But in device recovery we never free the tagset... I might be missing the race here then... > BTW, blk_mq_complete_request_sync() is a bit misleading, maybe > blk_mq_complete_request_locally() is better. Not really... ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-21 2:04 ` Sagi Grimberg @ 2019-03-21 2:32 ` Ming Lei 2019-03-21 21:40 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: Ming Lei @ 2019-03-21 2:32 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, linux-block, Bart Van Assche, linux-nvme, Christoph Hellwig On Wed, Mar 20, 2019 at 07:04:09PM -0700, Sagi Grimberg wrote: > > > > Hi Bart, > > > > > > If I understand the race correctly, its not between the requests > > > completion and the queue pairs removal nor the timeout handler > > > necessarily, but rather it is between the async requests completion and > > > the tagset deallocation. > > > > > > Think of surprise removal (or disconnect) during I/O, drivers > > > usually stop/quiesce/freeze the queues, terminate/abort inflight > > > I/Os and then teardown the hw queues and the tagset. > > > > > > IIRC, the same race holds for srp if this happens during I/O: > > > 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() -> > > > __rport_fail_io_fast() > > > > > > 2. complete all I/Os (async remotely via smp) > > > > > > Then continue.. > > > > > > 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags() > > > > > > What is preventing (3) from happening before (2) if its async? I would > > > think that scsi drivers need the exact same thing... > > > > blk_cleanup_queue() will do that, but it can't be used in device recovery > > obviously. > > But in device recovery we never free the tagset... I might be missing > the race here then... For example, nvme_rdma_complete_rq ->nvme_rdma_unmap_data ->ib_mr_pool_put But the ib queue pair may has been destroyed by nvme_rdma_destroy_io_queues() before request's remote completion. nvme_rdma_teardown_io_queues: nvme_stop_queues(&ctrl->ctrl); nvme_rdma_stop_io_queues(ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, &ctrl->ctrl); if (remove) nvme_start_queues(&ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, remove); > > > BTW, blk_mq_complete_request_sync() is a bit misleading, maybe > > blk_mq_complete_request_locally() is better. > > Not really... Naming is always the hard part... Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-21 2:32 ` Ming Lei @ 2019-03-21 21:40 ` Sagi Grimberg 2019-03-27 8:27 ` Christoph Hellwig 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2019-03-21 21:40 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, linux-block, Bart Van Assche, linux-nvme, Christoph Hellwig > For example, > > nvme_rdma_complete_rq > ->nvme_rdma_unmap_data > ->ib_mr_pool_put > > But the ib queue pair may has been destroyed by nvme_rdma_destroy_io_queues() > before request's remote completion. > > nvme_rdma_teardown_io_queues: > nvme_stop_queues(&ctrl->ctrl); > nvme_rdma_stop_io_queues(ctrl); > blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, > &ctrl->ctrl); > if (remove) > nvme_start_queues(&ctrl->ctrl); > nvme_rdma_destroy_io_queues(ctrl, remove); > Yea, you're right. I'm fine with this patchset. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-21 21:40 ` Sagi Grimberg @ 2019-03-27 8:27 ` Christoph Hellwig 0 siblings, 0 replies; 29+ messages in thread From: Christoph Hellwig @ 2019-03-27 8:27 UTC (permalink / raw) To: Sagi Grimberg Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche, linux-nvme, Christoph Hellwig On Thu, Mar 21, 2019 at 02:40:51PM -0700, Sagi Grimberg wrote: >> nvme_rdma_teardown_io_queues: >> nvme_stop_queues(&ctrl->ctrl); >> nvme_rdma_stop_io_queues(ctrl); >> blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, >> &ctrl->ctrl); >> if (remove) >> nvme_start_queues(&ctrl->ctrl); >> nvme_rdma_destroy_io_queues(ctrl, remove); >> > > Yea, you're right. I'm fine with this patchset. Is this an official Reviewed-by? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-21 0:47 ` Sagi Grimberg 2019-03-21 1:39 ` Ming Lei @ 2019-03-21 2:15 ` Bart Van Assche 1 sibling, 0 replies; 29+ messages in thread From: Bart Van Assche @ 2019-03-21 2:15 UTC (permalink / raw) To: Sagi Grimberg, Ming Lei Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On 3/20/19 5:47 PM, Sagi Grimberg wrote: > If I understand the race correctly, its not between the requests > completion and the queue pairs removal nor the timeout handler > necessarily, but rather it is between the async requests completion and > the tagset deallocation. > > Think of surprise removal (or disconnect) during I/O, drivers > usually stop/quiesce/freeze the queues, terminate/abort inflight > I/Os and then teardown the hw queues and the tagset. > > IIRC, the same race holds for srp if this happens during I/O: > 1. srp_rport_delete() -> srp_remove_target() -> srp_stop_rport_timers() > -> __rport_fail_io_fast() > > 2. complete all I/Os (async remotely via smp) > > Then continue.. > > 3. scsi_host_put() -> scsi_host_dev_release() -> scsi_mq_destroy_tags() > > What is preventing (3) from happening before (2) if its async? I would > think that scsi drivers need the exact same thing... Hi Sagi, As Ming already replied, I don't think that (3) can happen before (2) in case of the SRP driver. If you have a look at srp_remove_target() you will see that it calls scsi_remove_host(). That function only returns after blk_cleanup_queue() has been called for all associated request queues. As you know that function waits until all outstanding requests have completed. Bart. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 7:38 ` Ming Lei 2019-03-18 15:04 ` Bart Van Assche @ 2019-03-21 2:13 ` Sagi Grimberg 1 sibling, 0 replies; 29+ messages in thread From: Sagi Grimberg @ 2019-03-21 2:13 UTC (permalink / raw) To: Ming Lei, Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme >>> In NVMe's error handler, follows the typical steps for tearing down >>> hardware: >>> >>> 1) stop blk_mq hw queues >>> 2) stop the real hw queues >>> 3) cancel in-flight requests via >>> blk_mq_tagset_busy_iter(tags, cancel_request, ...) >>> cancel_request(): >>> mark the request as abort >>> blk_mq_complete_request(req); >>> 4) destroy real hw queues >>> >>> However, there may be race between #3 and #4, because blk_mq_complete_request() >>> actually completes the request asynchronously. >>> >>> This patch introduces blk_mq_complete_request_sync() for fixing the >>> above race. >> >> Other block drivers wait until outstanding requests have completed by >> calling blk_cleanup_queue() before hardware queues are destroyed. Why can't >> the NVMe driver follow that approach? > > The tearing down of controller can be done in error handler, in which > the request queues may not be cleaned up, almost all kinds of NVMe > controller's error handling follows the above steps, such as: > > nvme_rdma_error_recovery_work() > ->nvme_rdma_teardown_io_queues() Clarification, this happens in its dedicated context, not the timeout or error handler. But I still don't understand the issue here, what is the exact race you are referring to? that we abort/cancel a request and then we complete it again when we destroy the HW queue? If so, that is not the case (at least for rdma/tcp) because nvme_rdma_teardown_io_queues() first flushes the hw queue and then aborts inflight I/O. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 4:09 ` Bart Van Assche 2019-03-18 7:38 ` Ming Lei @ 2019-03-18 14:40 ` Keith Busch 2019-03-18 17:30 ` James Smart 2 siblings, 0 replies; 29+ messages in thread From: Keith Busch @ 2019-03-18 14:40 UTC (permalink / raw) To: Bart Van Assche Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Sun, Mar 17, 2019 at 09:09:09PM -0700, Bart Van Assche wrote: > On 3/17/19 8:29 PM, Ming Lei wrote: > > In NVMe's error handler, follows the typical steps for tearing down > > hardware: > > > > 1) stop blk_mq hw queues > > 2) stop the real hw queues > > 3) cancel in-flight requests via > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > cancel_request(): > > mark the request as abort > > blk_mq_complete_request(req); > > 4) destroy real hw queues > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > actually completes the request asynchronously. > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > above race. > > Other block drivers wait until outstanding requests have completed by > calling blk_cleanup_queue() before hardware queues are destroyed. Why can't > the NVMe driver follow that approach? You can't just wait for an outstanding request indefinitely. We have to safely make forward progress when we've determined it's not going to be completed. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 4:09 ` Bart Van Assche 2019-03-18 7:38 ` Ming Lei 2019-03-18 14:40 ` Keith Busch @ 2019-03-18 17:30 ` James Smart 2 siblings, 0 replies; 29+ messages in thread From: James Smart @ 2019-03-18 17:30 UTC (permalink / raw) To: Bart Van Assche, Ming Lei, Jens Axboe Cc: linux-block, Christoph Hellwig, linux-nvme On 3/17/2019 9:09 PM, Bart Van Assche wrote: > On 3/17/19 8:29 PM, Ming Lei wrote: >> In NVMe's error handler, follows the typical steps for tearing down >> hardware: >> >> 1) stop blk_mq hw queues >> 2) stop the real hw queues >> 3) cancel in-flight requests via >> blk_mq_tagset_busy_iter(tags, cancel_request, ...) >> cancel_request(): >> mark the request as abort >> blk_mq_complete_request(req); >> 4) destroy real hw queues >> >> However, there may be race between #3 and #4, because >> blk_mq_complete_request() >> actually completes the request asynchronously. >> >> This patch introduces blk_mq_complete_request_sync() for fixing the >> above race. > > Other block drivers wait until outstanding requests have completed by > calling blk_cleanup_queue() before hardware queues are destroyed. Why > can't the NVMe driver follow that approach? > speaking for the fabrics, not necessarily pci: The intent of this looping, which happens immediately following an error being detected, is to cause the termination of the outstanding requests. Otherwise, the only recourse is to wait for the ios to finish, which they may never do, or have their upper-level timeout expire to cause their termination - thus a very long delay. And one of the commands, on the admin queue - a different tag set but handled the same, doesn't have a timeout (the Async Event Reporting command) so it wouldn't necessarily clear without this looping. -- james ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei 2019-03-18 4:09 ` Bart Van Assche @ 2019-03-18 17:37 ` James Smart 2019-03-19 1:06 ` Ming Lei 2019-03-19 1:31 ` Ming Lei 2019-03-27 8:30 ` Christoph Hellwig 2 siblings, 2 replies; 29+ messages in thread From: James Smart @ 2019-03-18 17:37 UTC (permalink / raw) To: Ming Lei, Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme On 3/17/2019 8:29 PM, Ming Lei wrote: > In NVMe's error handler, follows the typical steps for tearing down > hardware: > > 1) stop blk_mq hw queues > 2) stop the real hw queues > 3) cancel in-flight requests via > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > cancel_request(): > mark the request as abort > blk_mq_complete_request(req); > 4) destroy real hw queues > > However, there may be race between #3 and #4, because blk_mq_complete_request() > actually completes the request asynchronously. > > This patch introduces blk_mq_complete_request_sync() for fixing the > above race. > This won't help FC at all. Inherently, the "completion" has to be asynchronous as line traffic may be required. e.g. FC doesn't use nvme_complete_request() in the iterator routine. -- james ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 17:37 ` James Smart @ 2019-03-19 1:06 ` Ming Lei 2019-03-19 3:37 ` James Smart 2019-03-19 1:31 ` Ming Lei 1 sibling, 1 reply; 29+ messages in thread From: Ming Lei @ 2019-03-19 1:06 UTC (permalink / raw) To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote: > > > On 3/17/2019 8:29 PM, Ming Lei wrote: > > In NVMe's error handler, follows the typical steps for tearing down > > hardware: > > > > 1) stop blk_mq hw queues > > 2) stop the real hw queues > > 3) cancel in-flight requests via > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > cancel_request(): > > mark the request as abort > > blk_mq_complete_request(req); > > 4) destroy real hw queues > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > actually completes the request asynchronously. > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > above race. > > > > This won't help FC at all. Inherently, the "completion" has to be > asynchronous as line traffic may be required. > > e.g. FC doesn't use nvme_complete_request() in the iterator routine. Yeah, I saw the FC code, it is supposed to address the asynchronous completion of blk_mq_complete_request() in error handler. Also I think it is always the correct thing to abort requests synchronously in error handler, isn't it? thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-19 1:06 ` Ming Lei @ 2019-03-19 3:37 ` James Smart 2019-03-19 3:50 ` Ming Lei 0 siblings, 1 reply; 29+ messages in thread From: James Smart @ 2019-03-19 3:37 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On 3/18/2019 6:06 PM, Ming Lei wrote: > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote: >> >> On 3/17/2019 8:29 PM, Ming Lei wrote: >>> In NVMe's error handler, follows the typical steps for tearing down >>> hardware: >>> >>> 1) stop blk_mq hw queues >>> 2) stop the real hw queues >>> 3) cancel in-flight requests via >>> blk_mq_tagset_busy_iter(tags, cancel_request, ...) >>> cancel_request(): >>> mark the request as abort >>> blk_mq_complete_request(req); >>> 4) destroy real hw queues >>> >>> However, there may be race between #3 and #4, because blk_mq_complete_request() >>> actually completes the request asynchronously. >>> >>> This patch introduces blk_mq_complete_request_sync() for fixing the >>> above race. >>> >> This won't help FC at all. Inherently, the "completion" has to be >> asynchronous as line traffic may be required. >> >> e.g. FC doesn't use nvme_complete_request() in the iterator routine. > Yeah, I saw the FC code, it is supposed to address the asynchronous > completion of blk_mq_complete_request() in error handler. > > Also I think it is always the correct thing to abort requests > synchronously in error handler, isn't it? > not sure I fully follow you, but if you're asking shouldn't it always be synchronous - why would that be the case ? I really don't want a blocking thread that could block for several seconds on a single io to complete. The controller has changed state and the queues frozen which should have been sufficient - but bottom-end io can still complete at any time. -- james ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-19 3:37 ` James Smart @ 2019-03-19 3:50 ` Ming Lei 0 siblings, 0 replies; 29+ messages in thread From: Ming Lei @ 2019-03-19 3:50 UTC (permalink / raw) To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 08:37:35PM -0700, James Smart wrote: > > > On 3/18/2019 6:06 PM, Ming Lei wrote: > > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote: > > > > > > On 3/17/2019 8:29 PM, Ming Lei wrote: > > > > In NVMe's error handler, follows the typical steps for tearing down > > > > hardware: > > > > > > > > 1) stop blk_mq hw queues > > > > 2) stop the real hw queues > > > > 3) cancel in-flight requests via > > > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > > > cancel_request(): > > > > mark the request as abort > > > > blk_mq_complete_request(req); > > > > 4) destroy real hw queues > > > > > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > > > actually completes the request asynchronously. > > > > > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > > > above race. > > > > > > > This won't help FC at all. Inherently, the "completion" has to be > > > asynchronous as line traffic may be required. > > > > > > e.g. FC doesn't use nvme_complete_request() in the iterator routine. > > Yeah, I saw the FC code, it is supposed to address the asynchronous > > completion of blk_mq_complete_request() in error handler. > > > > Also I think it is always the correct thing to abort requests > > synchronously in error handler, isn't it? > > > > not sure I fully follow you, but if you're asking shouldn't it always be > synchronous - why would that be the case ? I really don't want a blocking > thread that could block for several seconds on a single io to complete. The We are talking error handler, in which all in-flight requests are simply aborted via blk_mq_tagset_busy_iter(nvme_cancel_request, ...), and there isn't any waiting for single io to complete. nvme_cancel_request() basically re-queues the in-flight request to blk-mq's queues, and the time is pretty short, and I guess blk_mq_complete_request_sync() should be quicker than blk_mq_complete_request() under this situation. > controller has changed state and the queues frozen which should have been > sufficient - but bottom-end io can still complete at any time. Queues have been quiesced or stopped for recovering, and queue freezing requires to wait for completion of all in-flight requests, then a new IO deadlock is made... Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 17:37 ` James Smart 2019-03-19 1:06 ` Ming Lei @ 2019-03-19 1:31 ` Ming Lei 2019-03-19 4:04 ` James Smart 1 sibling, 1 reply; 29+ messages in thread From: Ming Lei @ 2019-03-19 1:31 UTC (permalink / raw) To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote: > > > On 3/17/2019 8:29 PM, Ming Lei wrote: > > In NVMe's error handler, follows the typical steps for tearing down > > hardware: > > > > 1) stop blk_mq hw queues > > 2) stop the real hw queues > > 3) cancel in-flight requests via > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > cancel_request(): > > mark the request as abort > > blk_mq_complete_request(req); > > 4) destroy real hw queues > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > actually completes the request asynchronously. > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > above race. > > > > This won't help FC at all. Inherently, the "completion" has to be > asynchronous as line traffic may be required. > > e.g. FC doesn't use nvme_complete_request() in the iterator routine. > Looks FC has done the sync already, see nvme_fc_delete_association(): ... /* wait for all io that had to be aborted */ spin_lock_irq(&ctrl->lock); wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock); ctrl->flags &= ~FCCTRL_TERMIO; spin_unlock_irq(&ctrl->lock); Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-19 1:31 ` Ming Lei @ 2019-03-19 4:04 ` James Smart 2019-03-19 4:28 ` Ming Lei 0 siblings, 1 reply; 29+ messages in thread From: James Smart @ 2019-03-19 4:04 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On 3/18/2019 6:31 PM, Ming Lei wrote: > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote: >> >> On 3/17/2019 8:29 PM, Ming Lei wrote: >>> In NVMe's error handler, follows the typical steps for tearing down >>> hardware: >>> >>> 1) stop blk_mq hw queues >>> 2) stop the real hw queues >>> 3) cancel in-flight requests via >>> blk_mq_tagset_busy_iter(tags, cancel_request, ...) >>> cancel_request(): >>> mark the request as abort >>> blk_mq_complete_request(req); >>> 4) destroy real hw queues >>> >>> However, there may be race between #3 and #4, because blk_mq_complete_request() >>> actually completes the request asynchronously. >>> >>> This patch introduces blk_mq_complete_request_sync() for fixing the >>> above race. >>> >> This won't help FC at all. Inherently, the "completion" has to be >> asynchronous as line traffic may be required. >> >> e.g. FC doesn't use nvme_complete_request() in the iterator routine. >> > Looks FC has done the sync already, see nvme_fc_delete_association(): > > ... > /* wait for all io that had to be aborted */ > spin_lock_irq(&ctrl->lock); > wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock); > ctrl->flags &= ~FCCTRL_TERMIO; > spin_unlock_irq(&ctrl->lock); yes - but the iterator started a lot of the back end io terminating in parallel. So waiting on many happening in parallel is better than waiting 1 at a time. Even so, I've always disliked this wait and would have preferred to exit the thread with something monitoring the completions re-queuing a work thread to finish. -- james ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-19 4:04 ` James Smart @ 2019-03-19 4:28 ` Ming Lei 0 siblings, 0 replies; 29+ messages in thread From: Ming Lei @ 2019-03-19 4:28 UTC (permalink / raw) To: James Smart; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 09:04:37PM -0700, James Smart wrote: > > > On 3/18/2019 6:31 PM, Ming Lei wrote: > > On Mon, Mar 18, 2019 at 10:37:08AM -0700, James Smart wrote: > > > > > > On 3/17/2019 8:29 PM, Ming Lei wrote: > > > > In NVMe's error handler, follows the typical steps for tearing down > > > > hardware: > > > > > > > > 1) stop blk_mq hw queues > > > > 2) stop the real hw queues > > > > 3) cancel in-flight requests via > > > > blk_mq_tagset_busy_iter(tags, cancel_request, ...) > > > > cancel_request(): > > > > mark the request as abort > > > > blk_mq_complete_request(req); > > > > 4) destroy real hw queues > > > > > > > > However, there may be race between #3 and #4, because blk_mq_complete_request() > > > > actually completes the request asynchronously. > > > > > > > > This patch introduces blk_mq_complete_request_sync() for fixing the > > > > above race. > > > > > > > This won't help FC at all. Inherently, the "completion" has to be > > > asynchronous as line traffic may be required. > > > > > > e.g. FC doesn't use nvme_complete_request() in the iterator routine. > > > > > Looks FC has done the sync already, see nvme_fc_delete_association(): > > > > ... > > /* wait for all io that had to be aborted */ > > spin_lock_irq(&ctrl->lock); > > wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock); > > ctrl->flags &= ~FCCTRL_TERMIO; > > spin_unlock_irq(&ctrl->lock); > > yes - but the iterator started a lot of the back end io terminating in > parallel. So waiting on many happening in parallel is better than waiting 1 > at a time. OK, that is FC's sync, not related with this patch. > Even so, I've always disliked this wait and would have > preferred to exit the thread with something monitoring the completions > re-queuing a work thread to finish. Then I guess you may like this patch given it actually avoids the potential wait, :-) What the patch does is to convert the remote completion(#1) into local completion(#2): 1) previously one request may be completed remotely by blk_mq_complete_request(): rq->csd.func = __blk_mq_complete_request_remote; rq->csd.info = rq; rq->csd.flags = 0; smp_call_function_single_async(ctx->cpu, &rq->csd); 2) this patch changes the remote completion into local completion via blk_mq_complete_request_sync(), so all in-flight requests can be aborted before destroying queue. q->mq_ops->complete(rq); As I mentioned in another email, there isn't any waiting for aborting request, nvme_cancel_request() simply requeues the request to blk-mq under this situation. Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() 2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei 2019-03-18 4:09 ` Bart Van Assche 2019-03-18 17:37 ` James Smart @ 2019-03-27 8:30 ` Christoph Hellwig 2 siblings, 0 replies; 29+ messages in thread From: Christoph Hellwig @ 2019-03-27 8:30 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme Please make this an EXPORT_SYMBOL_GPL as for all new low-level blk-mq exports. Otherwise looks fine modolo any minor documentatio nitpicks that I might have seen in the thread: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2/2] nvme: cancel request synchronously 2019-03-18 3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei 2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei @ 2019-03-18 3:29 ` Ming Lei 2019-03-27 8:30 ` Christoph Hellwig 2019-03-27 2:06 ` [PATCH 0/2] blk-mq/nvme: " Ming Lei 2 siblings, 1 reply; 29+ messages in thread From: Ming Lei @ 2019-03-18 3:29 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Ming Lei, Christoph Hellwig, linux-nvme nvme_cancel_request() is used in error handler, and it is always reliable to cancel request synchronously, and avoids possible race in which request may be completed after real hw queue is destroyed. One issue is reported by our customer on NVMe RDMA, in which freed ib queue pair may be used in nvme_rdma_complete_rq(). Cc: Christoph Hellwig <hch@lst.de> Cc: linux-nvme@lists.infradead.org Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 470601980794..2c43e12b70af 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -288,7 +288,7 @@ bool nvme_cancel_request(struct request *req, void *data, bool reserved) "Cancelling I/O %d", req->tag); nvme_req(req)->status = NVME_SC_ABORT_REQ; - blk_mq_complete_request(req); + blk_mq_complete_request_sync(req); return true; } EXPORT_SYMBOL_GPL(nvme_cancel_request); -- 2.9.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 2/2] nvme: cancel request synchronously 2019-03-18 3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei @ 2019-03-27 8:30 ` Christoph Hellwig 0 siblings, 0 replies; 29+ messages in thread From: Christoph Hellwig @ 2019-03-27 8:30 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-nvme Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] blk-mq/nvme: cancel request synchronously 2019-03-18 3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei 2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei 2019-03-18 3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei @ 2019-03-27 2:06 ` Ming Lei 2 siblings, 0 replies; 29+ messages in thread From: Ming Lei @ 2019-03-27 2:06 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-nvme On Mon, Mar 18, 2019 at 11:29:48AM +0800, Ming Lei wrote: > Hi, > > This patchset introduces blk_mq_complete_request_sync() for canceling > request synchronously in error handler context, then one race between > completing request and destroying contoller/queues can be fixed. > > > Ming Lei (2): > blk-mq: introduce blk_mq_complete_request_sync() > nvme: cancel request synchronously > > block/blk-mq.c | 20 ++++++++++++++++---- > drivers/nvme/host/core.c | 2 +- > include/linux/blk-mq.h | 1 + > 3 files changed, 18 insertions(+), 5 deletions(-) > > Cc: Christoph Hellwig <hch@lst.de> > Cc: linux-nvme@lists.infradead.org > > -- > 2.9.5 > Hi Jens and Christoph, This two simple patches do fix kernel oops in the "Link flap / Failover testing an NVMe over RDMA connection" from our customer. Looks no one objects this fix, could you consider them for v5.1 if you are fine? Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2019-03-27 8:30 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-03-18 3:29 [PATCH 0/2] blk-mq/nvme: cancel request synchronously Ming Lei 2019-03-18 3:29 ` [PATCH 1/2] blk-mq: introduce blk_mq_complete_request_sync() Ming Lei 2019-03-18 4:09 ` Bart Van Assche 2019-03-18 7:38 ` Ming Lei 2019-03-18 15:04 ` Bart Van Assche 2019-03-18 15:16 ` Ming Lei 2019-03-18 15:49 ` Bart Van Assche 2019-03-18 16:06 ` Ming Lei 2019-03-21 0:47 ` Sagi Grimberg 2019-03-21 1:39 ` Ming Lei 2019-03-21 2:04 ` Sagi Grimberg 2019-03-21 2:32 ` Ming Lei 2019-03-21 21:40 ` Sagi Grimberg 2019-03-27 8:27 ` Christoph Hellwig 2019-03-21 2:15 ` Bart Van Assche 2019-03-21 2:13 ` Sagi Grimberg 2019-03-18 14:40 ` Keith Busch 2019-03-18 17:30 ` James Smart 2019-03-18 17:37 ` James Smart 2019-03-19 1:06 ` Ming Lei 2019-03-19 3:37 ` James Smart 2019-03-19 3:50 ` Ming Lei 2019-03-19 1:31 ` Ming Lei 2019-03-19 4:04 ` James Smart 2019-03-19 4:28 ` Ming Lei 2019-03-27 8:30 ` Christoph Hellwig 2019-03-18 3:29 ` [PATCH 2/2] nvme: cancel request synchronously Ming Lei 2019-03-27 8:30 ` Christoph Hellwig 2019-03-27 2:06 ` [PATCH 0/2] blk-mq/nvme: " Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).