[PATCH 5.0 65/95] blk-mq: introduce blk_mq_complete_request

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 5.0 65/95] blk-mq: introduce blk_mq_complete_request_sync()
       [not found] <20190509181309.180685671@linuxfoundation.org>
@ 2019-05-09 18:42 ` Greg Kroah-Hartman
  2019-05-09 18:42 ` [PATCH 5.0 66/95] nvme: cancel request synchronously Greg Kroah-Hartman
  1 sibling, 0 replies; 10+ messages in thread
From: Greg Kroah-Hartman @ 2019-05-09 18:42 UTC (permalink / raw)


[ Upstream commit 1b8f21b74c3c9c82fce5a751d7aefb7cc0b8d33d ]

In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
	mark the request as abort
	blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Bart Van Assche <bvanassche at acm.org>
Cc: James Smart <james.smart at broadcom.com>
Cc: linux-nvme at lists.infradead.org
Reviewed-by: Keith Busch <keith.busch at intel.com>
Reviewed-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
Signed-off-by: Jens Axboe <axboe at kernel.dk>
Signed-off-by: Sasha Levin <sashal at kernel.org>
---
 block/blk-mq.c         | 7 +++++++
 include/linux/blk-mq.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5a2585d69c817..6930c82ab75fc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -657,6 +657,13 @@ bool blk_mq_complete_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
+void blk_mq_complete_request_sync(struct request *rq)
+{
+	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
+	rq->q->mq_ops->complete(rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
+
 int blk_mq_request_started(struct request *rq)
 {
 	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0e030f5f76b66..7e092bdac27f6 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -306,6 +306,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 bool blk_mq_complete_request(struct request *rq);
+void blk_mq_complete_request_sync(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
 			   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
       [not found] <20190509181309.180685671@linuxfoundation.org>
  2019-05-09 18:42 ` [PATCH 5.0 65/95] blk-mq: introduce blk_mq_complete_request_sync() Greg Kroah-Hartman
@ 2019-05-09 18:42 ` Greg Kroah-Hartman
  2019-05-21  8:36   ` Max Gurtovoy
  1 sibling, 1 reply; 10+ messages in thread
From: Greg Kroah-Hartman @ 2019-05-09 18:42 UTC (permalink / raw)


[ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]

nvme_cancel_request() is used in error handler, and it is always
reliable to cancel request synchronously, and avoids possible race
in which request may be completed after real hw queue is destroyed.

One issue is reported by our customer on NVMe RDMA, in which freed ib
queue pair may be used in nvme_rdma_complete_rq().

Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Bart Van Assche <bvanassche at acm.org>
Cc: James Smart <james.smart at broadcom.com>
Cc: linux-nvme at lists.infradead.org
Reviewed-by: Keith Busch <keith.busch at intel.com>
Reviewed-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
Signed-off-by: Jens Axboe <axboe at kernel.dk>
Signed-off-by: Sasha Levin <sashal at kernel.org>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6a9dd68c0f4fe..4c4413ad3ceb3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -291,7 +291,7 @@ bool nvme_cancel_request(struct request *req, void *data, bool reserved)
 				"Cancelling I/O %d", req->tag);
 
 	nvme_req(req)->status = NVME_SC_ABORT_REQ;
-	blk_mq_complete_request(req);
+	blk_mq_complete_request_sync(req);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-09 18:42 ` [PATCH 5.0 66/95] nvme: cancel request synchronously Greg Kroah-Hartman
@ 2019-05-21  8:36   ` Max Gurtovoy
  2019-05-21  9:45     ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Max Gurtovoy @ 2019-05-21  8:36 UTC (permalink / raw)


On 5/9/2019 9:42 PM, Greg Kroah-Hartman wrote:
> [ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]
>
> nvme_cancel_request() is used in error handler, and it is always
> reliable to cancel request synchronously, and avoids possible race
> in which request may be completed after real hw queue is destroyed.

Ming,

If the completion is async in the block layer, can't a "good" request 
(not a canceled one..) complete after real HW queue is destroyed ?


>
> One issue is reported by our customer on NVMe RDMA, in which freed ib
> queue pair may be used in nvme_rdma_complete_rq().
>
> Cc: Sagi Grimberg <sagi at grimberg.me>
> Cc: Bart Van Assche <bvanassche at acm.org>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: linux-nvme at lists.infradead.org
> Reviewed-by: Keith Busch <keith.busch at intel.com>
> Reviewed-by: Christoph Hellwig <hch at lst.de>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> Signed-off-by: Jens Axboe <axboe at kernel.dk>
> Signed-off-by: Sasha Levin <sashal at kernel.org>
> ---
>   drivers/nvme/host/core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6a9dd68c0f4fe..4c4413ad3ceb3 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -291,7 +291,7 @@ bool nvme_cancel_request(struct request *req, void *data, bool reserved)
>   				"Cancelling I/O %d", req->tag);
>   
>   	nvme_req(req)->status = NVME_SC_ABORT_REQ;
> -	blk_mq_complete_request(req);
> +	blk_mq_complete_request_sync(req);
>   	return true;
>   }
>   EXPORT_SYMBOL_GPL(nvme_cancel_request);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-21  8:36   ` Max Gurtovoy
@ 2019-05-21  9:45     ` Ming Lei
  2019-05-21 10:21       ` Max Gurtovoy
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-05-21  9:45 UTC (permalink / raw)

On Tue, May 21, 2019@11:36:26AM +0300, Max Gurtovoy wrote:
> On 5/9/2019 9:42 PM, Greg Kroah-Hartman wrote:
> > [ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]
> > 
> > nvme_cancel_request() is used in error handler, and it is always
> > reliable to cancel request synchronously, and avoids possible race
> > in which request may be completed after real hw queue is destroyed.
> 
> Ming,
> 
> If the completion is async in the block layer, can't a "good" request (not a
> canceled one..) complete after real HW queue is destroyed ?

In theory, it can't.

1) in case of error recovery

It is driver's responsibility to sync normal completion and handling
error. NVMe PCI calls nvme_dev_disable() to shutdown controller, and
there won't be good request any more after nvme_dev_disable() returns.
I am not very familiar with NVMe RDMA code, but nvme_rdma_stop_io_queues()
is supposed to do that for avoiding race with normal completion. Otherwise,
it isn't enough by simply canceling in-flight requests.

2) in case of device removal
blk_cleanup_queue() drains all in-queue requests, so there can't be
such issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-21  9:45     ` Ming Lei
@ 2019-05-21 10:21       ` Max Gurtovoy
  2019-05-21 10:41         ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Max Gurtovoy @ 2019-05-21 10:21 UTC (permalink / raw)



On 5/21/2019 12:45 PM, Ming Lei wrote:
> On Tue, May 21, 2019@11:36:26AM +0300, Max Gurtovoy wrote:
>> On 5/9/2019 9:42 PM, Greg Kroah-Hartman wrote:
>>> [ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]
>>>
>>> nvme_cancel_request() is used in error handler, and it is always
>>> reliable to cancel request synchronously, and avoids possible race
>>> in which request may be completed after real hw queue is destroyed.
>> Ming,
>>
>> If the completion is async in the block layer, can't a "good" request (not a
>> canceled one..) complete after real HW queue is destroyed ?
> In theory, it can't.
>
> 1) in case of error recovery
>
> It is driver's responsibility to sync normal completion and handling
> error. NVMe PCI calls nvme_dev_disable() to shutdown controller, and
> there won't be good request any more after nvme_dev_disable() returns.
> I am not very familiar with NVMe RDMA code, but nvme_rdma_stop_io_queues()
> is supposed to do that for avoiding race with normal completion. Otherwise,
> it isn't enough by simply canceling in-flight requests.

Indeed nvme_rdma_stop_io_queues will guaranty that we won't get anything 
from the wire/HCA anymore.


But what happens to IO's that were completed before 
"nvme_rdma_stop_io_queues" in async way:

1. nvme_end_request --> blk_mq_complete_request (async)

2. error recovery starts (queues are stopped)

3. block layer calls ops->complete(rq) on rq from step #1

if the blk_mq_quiesce_queue + blk_mq_unquiesce_queue don't sync the 
requests from #1, i think it might be problematic..


>
> 2) in case of device removal
> blk_cleanup_queue() drains all in-queue requests, so there can't be
> such issue.
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-21 10:21       ` Max Gurtovoy
@ 2019-05-21 10:41         ` Ming Lei
  2019-05-21 11:50           ` Max Gurtovoy
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-05-21 10:41 UTC (permalink / raw)


On Tue, May 21, 2019@01:21:39PM +0300, Max Gurtovoy wrote:
> 
> On 5/21/2019 12:45 PM, Ming Lei wrote:
> > On Tue, May 21, 2019@11:36:26AM +0300, Max Gurtovoy wrote:
> > > On 5/9/2019 9:42 PM, Greg Kroah-Hartman wrote:
> > > > [ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]
> > > > 
> > > > nvme_cancel_request() is used in error handler, and it is always
> > > > reliable to cancel request synchronously, and avoids possible race
> > > > in which request may be completed after real hw queue is destroyed.
> > > Ming,
> > > 
> > > If the completion is async in the block layer, can't a "good" request (not a
> > > canceled one..) complete after real HW queue is destroyed ?
> > In theory, it can't.
> > 
> > 1) in case of error recovery
> > 
> > It is driver's responsibility to sync normal completion and handling
> > error. NVMe PCI calls nvme_dev_disable() to shutdown controller, and
> > there won't be good request any more after nvme_dev_disable() returns.
> > I am not very familiar with NVMe RDMA code, but nvme_rdma_stop_io_queues()
> > is supposed to do that for avoiding race with normal completion. Otherwise,
> > it isn't enough by simply canceling in-flight requests.
> 
> Indeed nvme_rdma_stop_io_queues will guaranty that we won't get anything
> from the wire/HCA anymore.
> 
> 
> But what happens to IO's that were completed before
> "nvme_rdma_stop_io_queues" in async way:
> 
> 1. nvme_end_request --> blk_mq_complete_request (async)
> 
> 2. error recovery starts (queues are stopped)
> 
> 3. block layer calls ops->complete(rq) on rq from step #1
> 
> if the blk_mq_quiesce_queue + blk_mq_unquiesce_queue don't sync the requests
> from #1, i think it might be problematic..

You are right, we might have to wait until there isn't any in-flight request
which is marked as transient MQ_RQ_COMPLETE before destroying hw queue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-21 10:41         ` Ming Lei
@ 2019-05-21 11:50           ` Max Gurtovoy
  2019-05-21 12:49             ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Max Gurtovoy @ 2019-05-21 11:50 UTC (permalink / raw)



On 5/21/2019 1:41 PM, Ming Lei wrote:
> On Tue, May 21, 2019@01:21:39PM +0300, Max Gurtovoy wrote:
>> On 5/21/2019 12:45 PM, Ming Lei wrote:
>>> On Tue, May 21, 2019@11:36:26AM +0300, Max Gurtovoy wrote:
>>>> On 5/9/2019 9:42 PM, Greg Kroah-Hartman wrote:
>>>>> [ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]
>>>>>
>>>>> nvme_cancel_request() is used in error handler, and it is always
>>>>> reliable to cancel request synchronously, and avoids possible race
>>>>> in which request may be completed after real hw queue is destroyed.
>>>> Ming,
>>>>
>>>> If the completion is async in the block layer, can't a "good" request (not a
>>>> canceled one..) complete after real HW queue is destroyed ?
>>> In theory, it can't.
>>>
>>> 1) in case of error recovery
>>>
>>> It is driver's responsibility to sync normal completion and handling
>>> error. NVMe PCI calls nvme_dev_disable() to shutdown controller, and
>>> there won't be good request any more after nvme_dev_disable() returns.
>>> I am not very familiar with NVMe RDMA code, but nvme_rdma_stop_io_queues()
>>> is supposed to do that for avoiding race with normal completion. Otherwise,
>>> it isn't enough by simply canceling in-flight requests.
>> Indeed nvme_rdma_stop_io_queues will guaranty that we won't get anything
>> from the wire/HCA anymore.
>>
>>
>> But what happens to IO's that were completed before
>> "nvme_rdma_stop_io_queues" in async way:
>>
>> 1. nvme_end_request --> blk_mq_complete_request (async)
>>
>> 2. error recovery starts (queues are stopped)
>>
>> 3. block layer calls ops->complete(rq) on rq from step #1
>>
>> if the blk_mq_quiesce_queue + blk_mq_unquiesce_queue don't sync the requests
>> from #1, i think it might be problematic..
> You are right, we might have to wait until there isn't any in-flight request
> which is marked as transient MQ_RQ_COMPLETE before destroying hw queue.

is there an API in the block layer to guaranty that ?


>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-21 11:50           ` Max Gurtovoy
@ 2019-05-21 12:49             ` Ming Lei
  2019-05-24  8:15               ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-05-21 12:49 UTC (permalink / raw)


On Tue, May 21, 2019@02:50:22PM +0300, Max Gurtovoy wrote:
> 
> On 5/21/2019 1:41 PM, Ming Lei wrote:
> > On Tue, May 21, 2019@01:21:39PM +0300, Max Gurtovoy wrote:
> > > On 5/21/2019 12:45 PM, Ming Lei wrote:
> > > > On Tue, May 21, 2019@11:36:26AM +0300, Max Gurtovoy wrote:
> > > > > On 5/9/2019 9:42 PM, Greg Kroah-Hartman wrote:
> > > > > > [ Upstream commit eb3afb75b57c28599af0dfa03a99579d410749e9 ]
> > > > > > 
> > > > > > nvme_cancel_request() is used in error handler, and it is always
> > > > > > reliable to cancel request synchronously, and avoids possible race
> > > > > > in which request may be completed after real hw queue is destroyed.
> > > > > Ming,
> > > > > 
> > > > > If the completion is async in the block layer, can't a "good" request (not a
> > > > > canceled one..) complete after real HW queue is destroyed ?
> > > > In theory, it can't.
> > > > 
> > > > 1) in case of error recovery
> > > > 
> > > > It is driver's responsibility to sync normal completion and handling
> > > > error. NVMe PCI calls nvme_dev_disable() to shutdown controller, and
> > > > there won't be good request any more after nvme_dev_disable() returns.
> > > > I am not very familiar with NVMe RDMA code, but nvme_rdma_stop_io_queues()
> > > > is supposed to do that for avoiding race with normal completion. Otherwise,
> > > > it isn't enough by simply canceling in-flight requests.
> > > Indeed nvme_rdma_stop_io_queues will guaranty that we won't get anything
> > > from the wire/HCA anymore.
> > > 
> > > 
> > > But what happens to IO's that were completed before
> > > "nvme_rdma_stop_io_queues" in async way:
> > > 
> > > 1. nvme_end_request --> blk_mq_complete_request (async)
> > > 
> > > 2. error recovery starts (queues are stopped)
> > > 
> > > 3. block layer calls ops->complete(rq) on rq from step #1
> > > 
> > > if the blk_mq_quiesce_queue + blk_mq_unquiesce_queue don't sync the requests
> > > from #1, i think it might be problematic..
> > You are right, we might have to wait until there isn't any in-flight request
> > which is marked as transient MQ_RQ_COMPLETE before destroying hw queue.
> 
> is there an API in the block layer to guaranty that ?

So far there isn't yet.

And it can be built easily against blk_mq_tagset_busy_iter(), then called
in delay-spin style for checking if there is any request marked as
MQ_RQ_COMPLETE.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-21 12:49             ` Ming Lei
@ 2019-05-24  8:15               ` Sagi Grimberg
  2019-05-24  8:23                 ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2019-05-24  8:15 UTC (permalink / raw)



>> is there an API in the block layer to guaranty that ?
> 
> So far there isn't yet.
> 
> And it can be built easily against blk_mq_tagset_busy_iter(), then called
> in delay-spin style for checking if there is any request marked as
> MQ_RQ_COMPLETE.

This will probably make blk_mq_complete_request_sync() redundant
wouldn't it?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5.0 66/95] nvme: cancel request synchronously
  2019-05-24  8:15               ` Sagi Grimberg
@ 2019-05-24  8:23                 ` Ming Lei
  0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2019-05-24  8:23 UTC (permalink / raw)


On Fri, May 24, 2019@01:15:42AM -0700, Sagi Grimberg wrote:
> 
> > > is there an API in the block layer to guaranty that ?
> > 
> > So far there isn't yet.
> > 
> > And it can be built easily against blk_mq_tagset_busy_iter(), then called
> > in delay-spin style for checking if there is any request marked as
> > MQ_RQ_COMPLETE.
> 
> This will probably make blk_mq_complete_request_sync() redundant
> wouldn't it?

Suppose there is new API to drain completed request: blk_mq_wait_completing_rq(),
which has to be called before canceling request.

And it can be used to wait the canceled request too, or keep 
blk_mq_complete_request_sync() for avoiding the wait.

Either way should be fine, IMO.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-05-24  8:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190509181309.180685671@linuxfoundation.org>
2019-05-09 18:42 ` [PATCH 5.0 65/95] blk-mq: introduce blk_mq_complete_request_sync() Greg Kroah-Hartman
2019-05-09 18:42 ` [PATCH 5.0 66/95] nvme: cancel request synchronously Greg Kroah-Hartman
2019-05-21  8:36   ` Max Gurtovoy
2019-05-21  9:45     ` Ming Lei
2019-05-21 10:21       ` Max Gurtovoy
2019-05-21 10:41         ` Ming Lei
2019-05-21 11:50           ` Max Gurtovoy
2019-05-21 12:49             ` Ming Lei
2019-05-24  8:15               ` Sagi Grimberg
2019-05-24  8:23                 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox