* how can one drain MQ request queue ?
@ 2018-02-20 9:56 Max Gurtovoy
2018-02-20 10:13 ` Johannes Thumshirn
2018-02-22 2:59 ` Ming Lei
0 siblings, 2 replies; 6+ messages in thread
From: Max Gurtovoy @ 2018-02-20 9:56 UTC (permalink / raw)
To: linux-block@vger.kernel.org, Jens Axboe, Ming Lei, linux-nvme
hi all,
is there a way to drain a blk-mq based request queue (similar to
blk_drain_queue for non MQ) ?
I try to fix the following situation:
Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
ports during traffic using fio and making sure the traffic never fails.
when the switch port goes down the initiator driver start an error
recovery process
- blk_mq_quiesce_queue for each namespace request queue
- cancel all requests of the tagset using blk_mq_tagset_busy_iter
- destroy the QPs/RDMA connections and MR pools
- blk_mq_unquiesce_queue for each namespace request queue
- reconnect to the target (after creating RDMA resources again)
During the QP destruction, I see a warning that not all the memory
regions were back to the mr_pool. For every request we get from the
block layer (well, almost every request) we get a MR from the MR pool.
So what I see is that, depends on the timing, some requests are
dispatched/completed after we blk_mq_unquiesce_queue and after we
destroy the QP and the MR pool. Probably these request were inserted
during quiescing, and I want to flush/drain them before I destroy the QP.
Is there a way in the block layer that I can do it (we don't want to
destroy the tagset and the request_queue on each reconnection)?
I'm open for suggestion :)
Cheers,
Max.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how can one drain MQ request queue ?
2018-02-20 9:56 how can one drain MQ request queue ? Max Gurtovoy
@ 2018-02-20 10:13 ` Johannes Thumshirn
2018-02-22 2:59 ` Ming Lei
1 sibling, 0 replies; 6+ messages in thread
From: Johannes Thumshirn @ 2018-02-20 10:13 UTC (permalink / raw)
To: Max Gurtovoy
Cc: linux-block@vger.kernel.org, Jens Axboe, Ming Lei, linux-nvme
On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
> hi all,
> is there a way to drain a blk-mq based request queue (similar to
> blk_drain_queue for non MQ) ?
I _think_ you can do a echo run >
/sys/kernel/debug/block/nvmeXnX/hctxX/state and that should trigger
it.
Hope that helps.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how can one drain MQ request queue ?
2018-02-20 9:56 how can one drain MQ request queue ? Max Gurtovoy
2018-02-20 10:13 ` Johannes Thumshirn
@ 2018-02-22 2:59 ` Ming Lei
2018-02-22 10:56 ` Max Gurtovoy
1 sibling, 1 reply; 6+ messages in thread
From: Ming Lei @ 2018-02-22 2:59 UTC (permalink / raw)
To: Max Gurtovoy; +Cc: linux-block@vger.kernel.org, Jens Axboe, linux-nvme
Hi Max,
On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
> hi all,
> is there a way to drain a blk-mq based request queue (similar to
> blk_drain_queue for non MQ) ?
Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
based request queue, but it may not work well when the hardware is broken.
>
> I try to fix the following situation:
> Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
> ports during traffic using fio and making sure the traffic never fails.
>
> when the switch port goes down the initiator driver start an error recovery
What is the code you are referring to?
> process
> - blk_mq_quiesce_queue for each namespace request queue
blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
low level driver, and new requests still can be allocated, but can't be
dispatched until the queue becomes unquiesced.
> - cancel all requests of the tagset using blk_mq_tagset_busy_iter
Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
requests, and it depends on implementation of the busy_tag_iter_fn, and
timed-out request can't be covered by blk_mq_tagset_busy_iter().
So blk_mq_tagset_busy_iter() is often used in error recovery path, such
as nvme_dev_disable(), which is usually used in resetting PCIe NVMe controller.
> - destroy the QPs/RDMA connections and MR pools
> - blk_mq_unquiesce_queue for each namespace request queue
> - reconnect to the target (after creating RDMA resources again)
>
> During the QP destruction, I see a warning that not all the memory regions
> were back to the mr_pool. For every request we get from the block layer
> (well, almost every request) we get a MR from the MR pool.
> So what I see is that, depends on the timing, some requests are
> dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
> the QP and the MR pool. Probably these request were inserted during
> quiescing,
Yes.
> and I want to flush/drain them before I destroy the QP.
As mentioned above, you can't do that by blk_mq_quiesce_queue() &
blk_mq_tagset_busy_iter().
The PCIe NVMe driver takes two steps for the error recovery: nvme_dev_disable() &
nvme_reset_work(), and you may consider the similar approach, but the in-flight
requests won't be drained in this case because they can be requeued.
Could you explain a bit what your exact problem is?
Thanks,
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how can one drain MQ request queue ?
2018-02-22 2:59 ` Ming Lei
@ 2018-02-22 10:56 ` Max Gurtovoy
2018-02-22 13:10 ` Ming Lei
0 siblings, 1 reply; 6+ messages in thread
From: Max Gurtovoy @ 2018-02-22 10:56 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block@vger.kernel.org, Jens Axboe, linux-nvme
On 2/22/2018 4:59 AM, Ming Lei wrote:
> Hi Max,
Hi Ming,
>
> On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
>> hi all,
>> is there a way to drain a blk-mq based request queue (similar to
>> blk_drain_queue for non MQ) ?
>
> Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
> based request queue, but it may not work well when the hardware is broken.
I tried that, but the path failover takes ~cmd_timeout seconds and this
is not good enough...
>
>>
>> I try to fix the following situation:
>> Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
>> ports during traffic using fio and making sure the traffic never fails.
>>
>> when the switch port goes down the initiator driver start an error recovery
>
> What is the code you are referring to?
from nvme_rdma driver:
static void nvme_rdma_error_recovery_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);
nvme_stop_keep_alive(&ctrl->ctrl);
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
blk_mq_tagset_busy_iter(&ctrl->tag_set,
nvme_cancel_request, &ctrl->ctrl);
nvme_rdma_destroy_io_queues(ctrl, false);
}
blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
nvme_cancel_request, &ctrl->ctrl);
nvme_rdma_destroy_admin_queue(ctrl, false);
/*
* queues are not a live anymore, so restart the queues to fail
fast
* new IO
*/
blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
nvme_start_queues(&ctrl->ctrl);
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
/* state change failure should never happen */
WARN_ON_ONCE(1);
return;
}
nvme_rdma_reconnect_or_remove(ctrl);
}
>
>> process
>> - blk_mq_quiesce_queue for each namespace request queue
>
> blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
> low level driver, and new requests still can be allocated, but can't be
> dispatched until the queue becomes unquiesced.
>
>> - cancel all requests of the tagset using blk_mq_tagset_busy_iter
>
> Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
> requests, and it depends on implementation of the busy_tag_iter_fn, and
> timed-out request can't be covered by blk_mq_tagset_busy_iter().
How can we deal with timed-out commands ?
>
> So blk_mq_tagset_busy_iter() is often used in error recovery path, such
> as nvme_dev_disable(), which is usually used in resetting PCIe NVMe controller.
>
>> - destroy the QPs/RDMA connections and MR pools
>> - blk_mq_unquiesce_queue for each namespace request queue
>> - reconnect to the target (after creating RDMA resources again)
>>
>> During the QP destruction, I see a warning that not all the memory regions
>> were back to the mr_pool. For every request we get from the block layer
>> (well, almost every request) we get a MR from the MR pool.
>> So what I see is that, depends on the timing, some requests are
>> dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
>> the QP and the MR pool. Probably these request were inserted during
>> quiescing,
>
> Yes.
maybe we need to update the nvmf_check_init_req to check that the ctrl
is in NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think
about it and test it.
>
>> and I want to flush/drain them before I destroy the QP.
>
> As mentioned above, you can't do that by blk_mq_quiesce_queue() &
> blk_mq_tagset_busy_iter().
>
> The PCIe NVMe driver takes two steps for the error recovery: nvme_dev_disable() &
> nvme_reset_work(), and you may consider the similar approach, but the in-flight
> requests won't be drained in this case because they can be requeued.
>
> Could you explain a bit what your exact problem is?
The problem is that I assign an MR from QP mr_pool for each call to
nvme_rdma_queue_rq. During the error recovery I destroy the QP and the
mr_pool *but* some MR's are missing and not returned to the pool.
>
> Thanks,
> Ming
>
Thanks,
Max.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how can one drain MQ request queue ?
2018-02-22 10:56 ` Max Gurtovoy
@ 2018-02-22 13:10 ` Ming Lei
2018-02-22 13:39 ` Ming Lei
0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2018-02-22 13:10 UTC (permalink / raw)
To: Max Gurtovoy; +Cc: linux-block@vger.kernel.org, Jens Axboe, linux-nvme
On Thu, Feb 22, 2018 at 12:56:05PM +0200, Max Gurtovoy wrote:
>
>
> On 2/22/2018 4:59 AM, Ming Lei wrote:
> > Hi Max,
>
> Hi Ming,
>
> >
> > On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
> > > hi all,
> > > is there a way to drain a blk-mq based request queue (similar to
> > > blk_drain_queue for non MQ) ?
> >
> > Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
> > based request queue, but it may not work well when the hardware is broken.
>
> I tried that, but the path failover takes ~cmd_timeout seconds and this is
> not good enough...
Yeah, I agree it isn't good for handling timeout.
>
> >
> > >
> > > I try to fix the following situation:
> > > Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
> > > ports during traffic using fio and making sure the traffic never fails.
> > >
> > > when the switch port goes down the initiator driver start an error recovery
> >
> > What is the code you are referring to?
>
> from nvme_rdma driver:
>
> static void nvme_rdma_error_recovery_work(struct work_struct *work)
> {
> struct nvme_rdma_ctrl *ctrl = container_of(work,
> struct nvme_rdma_ctrl, err_work);
>
> nvme_stop_keep_alive(&ctrl->ctrl);
>
> if (ctrl->ctrl.queue_count > 1) {
> nvme_stop_queues(&ctrl->ctrl);
> blk_mq_tagset_busy_iter(&ctrl->tag_set,
> nvme_cancel_request, &ctrl->ctrl);
> nvme_rdma_destroy_io_queues(ctrl, false);
> }
>
> blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
> nvme_cancel_request, &ctrl->ctrl);
> nvme_rdma_destroy_admin_queue(ctrl, false);
I am not sure if it is good to destroy admin queue here since
nvme_rdma_configure_admin_queue() need to use admin queue, and I saw
there is report of 'nvme nvme0: Identify namespace failed' in Red Hat
BZ.
>
> /*
> * queues are not a live anymore, so restart the queues to fail fast
> * new IO
> */
> blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> nvme_start_queues(&ctrl->ctrl);
>
> if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
> /* state change failure should never happen */
> WARN_ON_ONCE(1);
> return;
> }
>
> nvme_rdma_reconnect_or_remove(ctrl);
> }
>
>
> >
> > > process
> > > - blk_mq_quiesce_queue for each namespace request queue
> >
> > blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
> > low level driver, and new requests still can be allocated, but can't be
> > dispatched until the queue becomes unquiesced.
> >
> > > - cancel all requests of the tagset using blk_mq_tagset_busy_iter
> >
> > Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
> > requests, and it depends on implementation of the busy_tag_iter_fn, and
> > timed-out request can't be covered by blk_mq_tagset_busy_iter().
>
> How can we deal with timed-out commands ?
For PCI NVMe, they are handled by requeuing, just like all canceled
in-flight commands, and all these commands will be dispatched to driver
again after reset is done successfully.
>
>
> >
> > So blk_mq_tagset_busy_iter() is often used in error recovery path, such
> > as nvme_dev_disable(), which is usually used in resetting PCIe NVMe controller.
> >
> > > - destroy the QPs/RDMA connections and MR pools
> > > - blk_mq_unquiesce_queue for each namespace request queue
> > > - reconnect to the target (after creating RDMA resources again)
> > >
> > > During the QP destruction, I see a warning that not all the memory regions
> > > were back to the mr_pool. For every request we get from the block layer
> > > (well, almost every request) we get a MR from the MR pool.
> > > So what I see is that, depends on the timing, some requests are
> > > dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
> > > the QP and the MR pool. Probably these request were inserted during
> > > quiescing,
> >
> > Yes.
>
> maybe we need to update the nvmf_check_init_req to check that the ctrl is in
> NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think about it
> and test it.
>
> >
> > > and I want to flush/drain them before I destroy the QP.
> >
> > As mentioned above, you can't do that by blk_mq_quiesce_queue() &
> > blk_mq_tagset_busy_iter().
> >
> > The PCIe NVMe driver takes two steps for the error recovery: nvme_dev_disable() &
> > nvme_reset_work(), and you may consider the similar approach, but the in-flight
> > requests won't be drained in this case because they can be requeued.
> >
> > Could you explain a bit what your exact problem is?
>
> The problem is that I assign an MR from QP mr_pool for each call to
> nvme_rdma_queue_rq. During the error recovery I destroy the QP and the
> mr_pool *but* some MR's are missing and not returned to the pool.
OK, looks you think all in-flight requests can be completed during error
recovery. That shouldn't be correct since all in-flight requests have to
be retried after error recovery is done for avoiding data loss.
So seems the mr_pool shouldn't be destroyed, I guess.
Thanks,
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how can one drain MQ request queue ?
2018-02-22 13:10 ` Ming Lei
@ 2018-02-22 13:39 ` Ming Lei
0 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2018-02-22 13:39 UTC (permalink / raw)
To: Max Gurtovoy; +Cc: linux-block@vger.kernel.org, Jens Axboe, linux-nvme
On Thu, Feb 22, 2018 at 09:10:26PM +0800, Ming Lei wrote:
> On Thu, Feb 22, 2018 at 12:56:05PM +0200, Max Gurtovoy wrote:
> >
> >
> > On 2/22/2018 4:59 AM, Ming Lei wrote:
> > > Hi Max,
> >
> > Hi Ming,
> >
> > >
> > > On Tue, Feb 20, 2018 at 11:56:07AM +0200, Max Gurtovoy wrote:
> > > > hi all,
> > > > is there a way to drain a blk-mq based request queue (similar to
> > > > blk_drain_queue for non MQ) ?
> > >
> > > Generally speaking, blk_mq_freeze_queue() should be fine to drain blk-mq
> > > based request queue, but it may not work well when the hardware is broken.
> >
> > I tried that, but the path failover takes ~cmd_timeout seconds and this is
> > not good enough...
>
> Yeah, I agree it isn't good for handling timeout.
>
> >
> > >
> > > >
> > > > I try to fix the following situation:
> > > > Running DM-multipath over NVMEoF/RDMA block devices, toggling the switch
> > > > ports during traffic using fio and making sure the traffic never fails.
> > > >
> > > > when the switch port goes down the initiator driver start an error recovery
> > >
> > > What is the code you are referring to?
> >
> > from nvme_rdma driver:
> >
> > static void nvme_rdma_error_recovery_work(struct work_struct *work)
> > {
> > struct nvme_rdma_ctrl *ctrl = container_of(work,
> > struct nvme_rdma_ctrl, err_work);
> >
> > nvme_stop_keep_alive(&ctrl->ctrl);
> >
> > if (ctrl->ctrl.queue_count > 1) {
> > nvme_stop_queues(&ctrl->ctrl);
> > blk_mq_tagset_busy_iter(&ctrl->tag_set,
> > nvme_cancel_request, &ctrl->ctrl);
> > nvme_rdma_destroy_io_queues(ctrl, false);
> > }
> >
> > blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> > blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
> > nvme_cancel_request, &ctrl->ctrl);
> > nvme_rdma_destroy_admin_queue(ctrl, false);
>
> I am not sure if it is good to destroy admin queue here since
> nvme_rdma_configure_admin_queue() need to use admin queue, and I saw
> there is report of 'nvme nvme0: Identify namespace failed' in Red Hat
> BZ.
>
> >
> > /*
> > * queues are not a live anymore, so restart the queues to fail fast
> > * new IO
> > */
> > blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> > nvme_start_queues(&ctrl->ctrl);
> >
> > if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
> > /* state change failure should never happen */
> > WARN_ON_ONCE(1);
> > return;
> > }
> >
> > nvme_rdma_reconnect_or_remove(ctrl);
> > }
> >
> >
> > >
> > > > process
> > > > - blk_mq_quiesce_queue for each namespace request queue
> > >
> > > blk_mq_quiesce_queue() only guarantees that no requests can be dispatched to
> > > low level driver, and new requests still can be allocated, but can't be
> > > dispatched until the queue becomes unquiesced.
> > >
> > > > - cancel all requests of the tagset using blk_mq_tagset_busy_iter
> > >
> > > Generally blk_mq_tagset_busy_iter() is used to cancel all in-flight
> > > requests, and it depends on implementation of the busy_tag_iter_fn, and
> > > timed-out request can't be covered by blk_mq_tagset_busy_iter().
> >
> > How can we deal with timed-out commands ?
>
> For PCI NVMe, they are handled by requeuing, just like all canceled
> in-flight commands, and all these commands will be dispatched to driver
> again after reset is done successfully.
>
> >
> >
> > >
> > > So blk_mq_tagset_busy_iter() is often used in error recovery path, such
> > > as nvme_dev_disable(), which is usually used in resetting PCIe NVMe controller.
> > >
> > > > - destroy the QPs/RDMA connections and MR pools
> > > > - blk_mq_unquiesce_queue for each namespace request queue
> > > > - reconnect to the target (after creating RDMA resources again)
> > > >
> > > > During the QP destruction, I see a warning that not all the memory regions
> > > > were back to the mr_pool. For every request we get from the block layer
> > > > (well, almost every request) we get a MR from the MR pool.
> > > > So what I see is that, depends on the timing, some requests are
> > > > dispatched/completed after we blk_mq_unquiesce_queue and after we destroy
> > > > the QP and the MR pool. Probably these request were inserted during
> > > > quiescing,
> > >
> > > Yes.
> >
> > maybe we need to update the nvmf_check_init_req to check that the ctrl is in
> > NVME_CTRL_LIVE state (otherwise return IOERR), but I need to think about it
> > and test it.
> >
> > >
> > > > and I want to flush/drain them before I destroy the QP.
> > >
> > > As mentioned above, you can't do that by blk_mq_quiesce_queue() &
> > > blk_mq_tagset_busy_iter().
> > >
> > > The PCIe NVMe driver takes two steps for the error recovery: nvme_dev_disable() &
> > > nvme_reset_work(), and you may consider the similar approach, but the in-flight
> > > requests won't be drained in this case because they can be requeued.
> > >
> > > Could you explain a bit what your exact problem is?
> >
> > The problem is that I assign an MR from QP mr_pool for each call to
> > nvme_rdma_queue_rq. During the error recovery I destroy the QP and the
> > mr_pool *but* some MR's are missing and not returned to the pool.
>
> OK, looks you think all in-flight requests can be completed during error
> recovery. That shouldn't be correct since all in-flight requests have to
> be retried after error recovery is done for avoiding data loss.
Looks there is one issue wrt. timed-out request:
nvme_rdma_destroy_io_queues() may be called before the timed-out
request is completed.
And it is very likely since the timed-out request is completed by
__blk_mq_complete_request() in blk_mq_rq_timed_out() only after
nvme_rdma_timeout() returns.
We discussed the similar issue on PCI NVMe too, seems RDMA need to
sync between error recovery and timeout handler too.
https://www.spinics.net/lists/stable/msg211856.html
--
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-02-22 13:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-20 9:56 how can one drain MQ request queue ? Max Gurtovoy
2018-02-20 10:13 ` Johannes Thumshirn
2018-02-22 2:59 ` Ming Lei
2018-02-22 10:56 ` Max Gurtovoy
2018-02-22 13:10 ` Ming Lei
2018-02-22 13:39 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox