linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme-rdma: device removal crash fixes
@ 2016-07-13 21:26 Ming Lin
  2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Ming Lin @ 2016-07-13 21:26 UTC (permalink / raw)


From: Ming Lin <ming.l@samsung.com>

Here are 2 fixes for the device removal crash.

Steve,

Could you review and add tag if you feel OK?
Thanks again for your great debug.

Ming Lin (2):
  nvme-rdma: grab reference for device removal event
  nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl

 drivers/nvme/host/rdma.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2] nvme-rdma: grab reference for device removal event
  2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin
@ 2016-07-13 21:26 ` Ming Lin
  2016-07-13 21:33   ` Steve Wise
  2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
  2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise
  2 siblings, 1 reply; 26+ messages in thread
From: Ming Lin @ 2016-07-13 21:26 UTC (permalink / raw)


From: Ming Lin <ming.l@samsung.com>

Below crash was triggered when shutting down a nvme host node
via 'reboot' that has 1 target device attached.

[   88.897220] BUG: unable to handle kernel paging request at ffffebe00400f820
[   88.905226] IP: [<ffffffff811e8d76>] kfree+0x56/0x170
[   89.182264] Call Trace:
[   89.185899]  [<ffffffffc09f7052>] nvme_rdma_free_ring.constprop.42+0x42/0xb0 [nvme_rdma]
[   89.195193]  [<ffffffffc09f77ba>] nvme_rdma_destroy_queue_ib+0x3a/0x60 [nvme_rdma]
[   89.203969]  [<ffffffffc09f92bc>] nvme_rdma_cm_handler+0x69c/0x8b6 [nvme_rdma]
[   89.212406]  [<ffffffff811e859b>] ? __slab_free+0x9b/0x2b0
[   89.219101]  [<ffffffffc0a2c694>] cma_remove_one+0x1f4/0x220 [rdma_cm]
[   89.226838]  [<ffffffffc09415b3>] ib_unregister_device+0xc3/0x160 [ib_core]
[   89.235008]  [<ffffffffc0a0798a>] mlx4_ib_remove+0x6a/0x220 [mlx4_ib]
[   89.242656]  [<ffffffffc097ede7>] mlx4_remove_device+0x97/0xb0 [mlx4_core]
[   89.250732]  [<ffffffffc097f48e>] mlx4_unregister_device+0x3e/0xa0 [mlx4_core]
[   89.259151]  [<ffffffffc0983a46>] mlx4_unload_one+0x86/0x2f0 [mlx4_core]
[   89.267049]  [<ffffffffc0983d97>] mlx4_shutdown+0x57/0x70 [mlx4_core]
[   89.274680]  [<ffffffff8141c4b6>] pci_device_shutdown+0x36/0x70
[   89.281792]  [<ffffffff81526c13>] device_shutdown+0xd3/0x180
[   89.288638]  [<ffffffff8109e556>] kernel_restart_prepare+0x36/0x40
[   89.296003]  [<ffffffff8109e602>] kernel_restart+0x12/0x60
[   89.302688]  [<ffffffff8109e983>] SYSC_reboot+0x1f3/0x220
[   89.309266]  [<ffffffff81186048>] ? __filemap_fdatawait_range+0xa8/0x150
[   89.317151]  [<ffffffff8123ec20>] ? fdatawait_one_bdev+0x20/0x20
[   89.324335]  [<ffffffff81188585>] ? __filemap_fdatawrite_range+0xb5/0xf0
[   89.332227]  [<ffffffff8122880a>] ? iput+0x8a/0x200
[   89.338294]  [<ffffffff8123ec00>] ? sync_inodes_one_sb+0x20/0x20
[   89.345465]  [<ffffffff812480d7>] ? iterate_bdevs+0x117/0x130
[   89.352367]  [<ffffffff8109ea0e>] SyS_reboot+0xe/0x10

Debug shows:

[31948.771952] MYDEBUG: init kref: nvme_init_ctrl
[31948.798589] MYDEBUG: get: nvme_rdma_create_ctrl
[31948.803765] MYDEBUG: put: nvmf_dev_release
[31948.808734] MYDEBUG: get: nvme_alloc_ns
[31948.884775] MYDEBUG: put: nvme_free_ns
[31948.890155] MYDEBUG in nvme_rdma_destroy_queue_ib: queue ffff8800cdc81470: io queue
[31948.900539] MYDEBUG: put: nvme_rdma_del_ctrl_work
[31948.909469] MYDEBUG: nvme_rdma_free_ctrl called
[31948.915379] MYDEBUG in nvme_rdma_destroy_queue_ib: queue ffff8800cdc81400: admin queue

So nvme_rdma_destroy_queue_ib() was called for admin queue
after ctrl was already freed.

Fixing it by get/put ctrl reference in nvme_rdma_device_unplug
so the ctrl won't be freed before we free the last queue.

Reported-by: Steve Wise <swise at opengridcomputing.com>
Signed-off-by: Ming Lin <ming.l at samsung.com>
---
 drivers/nvme/host/rdma.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 3e3ce2b..f845304 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1331,6 +1331,12 @@ static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue)
 	if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
 		goto out;
 
+	/*
+	 * Grab a reference so the ctrl won't be freed before we free
+	 * the last queue
+	 */
+	kref_get(&ctrl->ctrl.kref);
+
 	/* delete the controller */
 	ret = __nvme_rdma_del_ctrl(ctrl);
 	if (!ret) {
@@ -1347,6 +1353,8 @@ static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue)
 		nvme_rdma_destroy_queue_ib(queue);
 	}
 
+	nvme_put_ctrl(&ctrl->ctrl);
+
 out:
 	return ctrl_deleted;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin
  2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
@ 2016-07-13 21:26 ` Ming Lin
  2016-07-13 21:33   ` Steve Wise
                     ` (2 more replies)
  2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise
  2 siblings, 3 replies; 26+ messages in thread
From: Ming Lin @ 2016-07-13 21:26 UTC (permalink / raw)


From: Ming Lin <ming.l@samsung.com>

Steve reported below crash when unloading iw_cxgb4.

[59079.932154] nvme nvme1: Got rdma device removal event, deleting ctrl
[59080.034208] BUG: unable to handle kernel paging request at ffff880f4e6c01f8
[59080.041972] IP: [<ffffffffa02e5a46>] __ib_process_cq+0x46/0xc0 [ib_core]
[59080.049422] PGD 22a5067 PUD 10788d8067 PMD 1078864067 PTE 8000000f4e6c0060
[59080.057109] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC

[59080.164160] CPU: 0 PID: 14879 Comm: kworker/u64:2 Tainted: G            E   4.7.0-rc2-block-for-next+ #78
[59080.174704] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[59080.182673] Workqueue: iw_cxgb4 process_work [iw_cxgb4]
[59080.188924] task: ffff8810278646c0 ti: ffff880ff271c000 task.ti: ffff880ff271c000
[59080.197448] RIP: 0010:[<ffffffffa02e5a46>]  [<ffffffffa02e5a46>] __ib_process_cq+0x46/0xc0 [ib_core]
[59080.207647] RSP: 0018:ffff881036e03e48  EFLAGS: 00010282
[59080.214000] RAX: 0000000000000010 RBX: ffff8810203f3508 RCX: 0000000000000000
[59080.222194] RDX: ffff880f4e6c01f8 RSI: ffff880f4e6a1fe8 RDI: ffff8810203f3508
[59080.230393] RBP: ffff881036e03e88 R08: 0000000000000000 R09: 000000000000000c
[59080.238598] R10: 0000000000000000 R11: 00000000000001f8 R12: 0000000000000020
[59080.246800] R13: 0000000000000100 R14: 0000000000000000 R15: 0000000000000000
[59080.255002] FS:  0000000000000000(0000) GS:ffff881036e00000(0000) knlGS:0000000000000000
[59080.264173] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[59080.271013] CR2: ffff880f4e6c01f8 CR3: 000000102105f000 CR4: 00000000000406f0
[59080.279258] Stack:
[59080.282377]  0000000000000000 00000010fcddc1f8 0000000000000246 ffff8810203f3548
[59080.290979]  ffff881036e13630 0000000000000100 ffff8810203f3508 ffff881036e03ed8
[59080.299587]  ffff881036e03eb8 ffffffffa02e5e12 ffff8810203f3548 ffff881036e13630
[59080.308198] Call Trace:
[59080.311779]  <IRQ>
[59080.313731]  [<ffffffffa02e5e12>] ib_poll_handler+0x32/0x80 [ib_core]
[59080.322653]  [<ffffffff81395695>] irq_poll_softirq+0xa5/0xf0
[59080.329484]  [<ffffffff816f186a>] __do_softirq+0xda/0x304
[59080.336047]  [<ffffffff816f15b5>] ? do_IRQ+0x65/0xf0
[59080.342193]  [<ffffffff816f08fc>] do_softirq_own_stack+0x1c/0x30
[59080.349381]  <EOI>
[59080.351351]  [<ffffffff8109004e>] do_softirq+0x4e/0x50
[59080.359018]  [<ffffffff81090127>] __local_bh_enable_ip+0x87/0x90
[59080.366178]  [<ffffffffa081b837>] t4_ofld_send+0x127/0x180 [cxgb4]
[59080.373499]  [<ffffffffa08095ae>] cxgb4_remove_tid+0x9e/0x140 [cxgb4]
[59080.381079]  [<ffffffffa039235c>] _c4iw_free_ep+0x5c/0x100 [iw_cxgb4]
[59080.388665]  [<ffffffffa0396812>] peer_close+0x102/0x260 [iw_cxgb4]
[59080.396082]  [<ffffffffa039629a>] ? process_work+0x5a/0x70 [iw_cxgb4]
[59080.403664]  [<ffffffffa039629a>] ? process_work+0x5a/0x70 [iw_cxgb4]
[59080.411254]  [<ffffffff815c42c4>] ? __kfree_skb+0x34/0x80
[59080.417762]  [<ffffffff815c4437>] ? kfree_skb+0x47/0xb0
[59080.424084]  [<ffffffff815c24e7>] ? skb_dequeue+0x67/0x80
[59080.430569]  [<ffffffffa039628e>] process_work+0x4e/0x70 [iw_cxgb4]
[59080.437940]  [<ffffffff810a4d03>] process_one_work+0x183/0x4d0
[59080.444862]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[59080.451373]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[59080.457506]  [<ffffffff810a59bd>] worker_thread+0x16d/0x530
[59080.464056]  [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0
[59080.470570]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[59080.476985]  [<ffffffff810ccbc6>] ? __wake_up_common+0x56/0x90
[59080.483696]  [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120
[59080.490824]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[59080.496808]  [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120
[59080.503892]  [<ffffffff810aa5dc>] kthread+0xcc/0xf0
[59080.509573]  [<ffffffff810b4ffe>] ? schedule_tail+0x1e/0xc0
[59080.515928]  [<ffffffff816eed3f>] ret_from_fork+0x1f/0x40
[59080.522093]  [<ffffffff810aa510>] ? kthread_freezable_should_stop+0x70/0x70

Copy Steve's analysis:

"I think the problem is nvme_destroy_admin_queue() is called as part of
destroying the controller: nvme_rdma_del_ctrl_work() calls
nvme_rdma_shutdown_ctrl() which calls nvme_rdma_destroy_admin_queue().  The
admin nvme_rdma_queue doesn't get destroyed/freed, though, because the
NVME_RDMA_Q_CONNECTED flag has already been cleared by
nvme_rdma_device_unplug().  However nvme_destroy_admin_queue() _does_ free the
tag set memory, which I believe contains the nvme_rdma_request structs that
contain the ib_cqe struct, so when nvme_rdma_device_unplug() does finally flush
the QP we crash..."

Move the admin queue cleanup to nvme_rdma_free_ctrl() where we can make sure that
the RDMA queue was already disconnected and drained and no code will access it
any more.

Reported-by: Steve Wise <swise at opengridcomputing.com>
Signed-off-by: Ming Lin <ming.l at samsung.com>
---
 drivers/nvme/host/rdma.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index f845304..0d3c227 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -671,9 +671,6 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl)
 	nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 	nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
-	blk_cleanup_queue(ctrl->ctrl.admin_q);
-	blk_mq_free_tag_set(&ctrl->admin_tag_set);
-	nvme_rdma_dev_put(ctrl->device);
 }
 
 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
@@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
 	list_del(&ctrl->list);
 	mutex_unlock(&nvme_rdma_ctrl_mutex);
 
+	blk_cleanup_queue(ctrl->ctrl.admin_q);
+	blk_mq_free_tag_set(&ctrl->admin_tag_set);
+	nvme_rdma_dev_put(ctrl->device);
+
 	if (ctrl->ctrl.tagset) {
 		blk_cleanup_queue(ctrl->ctrl.connect_q);
 		blk_mq_free_tag_set(&ctrl->tag_set);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/2] nvme-rdma: grab reference for device removal event
  2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
@ 2016-07-13 21:33   ` Steve Wise
  0 siblings, 0 replies; 26+ messages in thread
From: Steve Wise @ 2016-07-13 21:33 UTC (permalink / raw)


Looks good.

Reviewed-by: Steve Wise <swise at opengridcomputing.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
@ 2016-07-13 21:33   ` Steve Wise
  2016-07-13 23:19   ` J Freyensee
  2016-07-14  9:15   ` Sagi Grimberg
  2 siblings, 0 replies; 26+ messages in thread
From: Steve Wise @ 2016-07-13 21:33 UTC (permalink / raw)


Looks good.

Reviewed-by: Steve Wise <swise at opengridcomputing.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 0/2] nvme-rdma: device removal crash fixes
  2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin
  2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
  2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
@ 2016-07-13 21:58 ` Steve Wise
  2 siblings, 0 replies; 26+ messages in thread
From: Steve Wise @ 2016-07-13 21:58 UTC (permalink / raw)


> From: Ming Lin <ming.l at samsung.com>
> 
> Here are 2 fixes for the device removal crash.
> 
> Steve,
> 
> Could you review and add tag if you feel OK?
> Thanks again for your great debug.

Thanks for the quick fixes.

The series tests good.  I verified no crashing on module unload/shutdown using
both cxgb4 and mlx4.

Tested-by: Steve Wise <swise at opengridcomputing.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
  2016-07-13 21:33   ` Steve Wise
@ 2016-07-13 23:19   ` J Freyensee
  2016-07-13 23:36     ` Ming Lin
  2016-07-14  9:15   ` Sagi Grimberg
  2 siblings, 1 reply; 26+ messages in thread
From: J Freyensee @ 2016-07-13 23:19 UTC (permalink / raw)



>  static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl
> *nctrl)
>  	list_del(&ctrl->list);
>  	mutex_unlock(&nvme_rdma_ctrl_mutex);
>  
> +	blk_cleanup_queue(ctrl->ctrl.admin_q);
> +	blk_mq_free_tag_set(&ctrl->admin_tag_set);
> +	nvme_rdma_dev_put(ctrl->device);
> +
>  	if (ctrl->ctrl.tagset) {
>  		blk_cleanup_queue(ctrl->ctrl.connect_q);
>  		blk_mq_free_tag_set(&ctrl->tag_set);

This patch does not remove the second

nvme_rdma_dev_put(ctrl->device);

call that happens within the if() statement above if it evaluates to
TRUE.  Should that have been removed or moved elsewhere?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 23:19   ` J Freyensee
@ 2016-07-13 23:36     ` Ming Lin
  2016-07-13 23:59       ` J Freyensee
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lin @ 2016-07-13 23:36 UTC (permalink / raw)


On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
>
>>  static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
>> @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl
>> *nctrl)
>>       list_del(&ctrl->list);
>>       mutex_unlock(&nvme_rdma_ctrl_mutex);
>>
>> +     blk_cleanup_queue(ctrl->ctrl.admin_q);
>> +     blk_mq_free_tag_set(&ctrl->admin_tag_set);
>> +     nvme_rdma_dev_put(ctrl->device);
>> +
>>       if (ctrl->ctrl.tagset) {
>>               blk_cleanup_queue(ctrl->ctrl.connect_q);
>>               blk_mq_free_tag_set(&ctrl->tag_set);
>
> This patch does not remove the second
>
> nvme_rdma_dev_put(ctrl->device);
>
> call that happens within the if() statement above if it evaluates to
> TRUE.  Should that have been removed or moved elsewhere?

Not sure I understand your question.
Did you mean line 694?

For discovery controller, there is no IO queues. So ctrl->ctrl.tagset is NULL.

The first bulk of
"blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for admin
queue.
And the second is for IO queues.

 676 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
 677 {
 678         struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
 679
 680         if (list_empty(&ctrl->list))
 681                 goto free_ctrl;
 682
 683         mutex_lock(&nvme_rdma_ctrl_mutex);
 684         list_del(&ctrl->list);
 685         mutex_unlock(&nvme_rdma_ctrl_mutex);
 686
 687         blk_cleanup_queue(ctrl->ctrl.admin_q);
 688         blk_mq_free_tag_set(&ctrl->admin_tag_set);
 689         nvme_rdma_dev_put(ctrl->device);
 690
 691         if (ctrl->ctrl.tagset) {
 692                 blk_cleanup_queue(ctrl->ctrl.connect_q);
 693                 blk_mq_free_tag_set(&ctrl->tag_set);
 694                 nvme_rdma_dev_put(ctrl->device);
 695         }
 696         kfree(ctrl->queues);
 697         nvmf_free_options(nctrl->opts);
 698 free_ctrl:
 699         kfree(ctrl);
 700 }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 23:36     ` Ming Lin
@ 2016-07-13 23:59       ` J Freyensee
  2016-07-14  6:39         ` Ming Lin
  0 siblings, 1 reply; 26+ messages in thread
From: J Freyensee @ 2016-07-13 23:59 UTC (permalink / raw)


On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote:
> On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee
> <james_p_freyensee@linux.intel.com> wrote:
> > 
> > >  static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct
> > > nvme_ctrl
> > > *nctrl)
> > >       list_del(&ctrl->list);
> > >       mutex_unlock(&nvme_rdma_ctrl_mutex);
> > > 
> > > +     blk_cleanup_queue(ctrl->ctrl.admin_q);
> > > +     blk_mq_free_tag_set(&ctrl->admin_tag_set);
> > > +     nvme_rdma_dev_put(ctrl->device);
> > > +
> > >       if (ctrl->ctrl.tagset) {
> > >               blk_cleanup_queue(ctrl->ctrl.connect_q);
> > >               blk_mq_free_tag_set(&ctrl->tag_set);
> > 
> > This patch does not remove the second
> > 
> > nvme_rdma_dev_put(ctrl->device);
> > 
> > call that happens within the if() statement above if it evaluates
> > to
> > TRUE.  Should that have been removed or moved elsewhere?
> 
> Not sure I understand your question.
> Did you mean line 694?

Yes I mean line 694.

> 
> For discovery controller, there is no IO queues. So ctrl->ctrl.tagset
> is NULL.
> 
> The first bulk of
> "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for
> admin
> queue.
> And the second is for IO queues.

I'm just confused when nvme_free_ctrl() in core.c calls:

ctrl->ops->free_ctrl(ctrl);

which looks like would be the only call that would free both the admin
and I/O rdma queues, why there would be the potential to do a _put()
twice in nvme_rdma_free_ctrl() via:

nvme_rdma_dev_put(ctrl->device);

one for the admin section:

687         blk_cleanup_queue(ctrl>ctrl.admin_q);
688         blk_mq_free_tag_set(&ctrl->admin_tag_set);
689         nvme_rdma_dev_put(ctrl->device);

and one for the I/O section (assuming "if (ctrl->ctrl.tagset)" evaluate
s to TRUE in that call):

691         if (ctrl->ctrl.tagset) {
692                 blk_cleanup_queue(ctrl->ctrl.connect_q);
693                 blk_mq_free_tag_set(&ctrl->tag_set);
694                 nvme_rdma_dev_put(ctrl->device);
695         }


My assumption would be that the correct path for this case would be
such that 

nvme_rdma_dev_put(ctrl->device); 

would only be called one time, for a single device.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 23:59       ` J Freyensee
@ 2016-07-14  6:39         ` Ming Lin
  2016-07-14 17:09           ` J Freyensee
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lin @ 2016-07-14  6:39 UTC (permalink / raw)


On Wed, 2016-07-13@16:59 -0700, J Freyensee wrote:
> On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote:
> > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee
> > <james_p_freyensee@linux.intel.com> wrote:
> > > 
> > > > ?static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct
> > > > nvme_ctrl
> > > > *nctrl)
> > > > ??????list_del(&ctrl->list);
> > > > ??????mutex_unlock(&nvme_rdma_ctrl_mutex);
> > > > 
> > > > +?????blk_cleanup_queue(ctrl->ctrl.admin_q);
> > > > +?????blk_mq_free_tag_set(&ctrl->admin_tag_set);
> > > > +?????nvme_rdma_dev_put(ctrl->device);
> > > > +
> > > > ??????if (ctrl->ctrl.tagset) {
> > > > ??????????????blk_cleanup_queue(ctrl->ctrl.connect_q);
> > > > ??????????????blk_mq_free_tag_set(&ctrl->tag_set);
> > > 
> > > This patch does not remove the second
> > > 
> > > nvme_rdma_dev_put(ctrl->device);
> > > 
> > > call that happens within the if() statement above if it evaluates
> > > to
> > > TRUE.??Should that have been removed or moved elsewhere?
> > 
> > Not sure I understand your question.
> > Did you mean line 694?
> 
> Yes I mean line 694.
> 
> > 
> > For discovery controller, there is no IO queues. So ctrl-
> > >ctrl.tagset
> > is NULL.
> > 
> > The first bulk of
> > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for
> > admin
> > queue.
> > And the second is for IO queues.
> 
> I'm just confused when nvme_free_ctrl() in core.c calls:
> 
> ctrl->ops->free_ctrl(ctrl);
> 
> which looks like would be the only call that would free both the
> admin
> and I/O rdma queues, why there would be the potential to do a _put()
> twice in nvme_rdma_free_ctrl() via:
> 
> nvme_rdma_dev_put(ctrl->device);
> 
> one for the admin section:
> 
> 687?????????blk_cleanup_queue(ctrl>ctrl.admin_q);
> 688?????????blk_mq_free_tag_set(&ctrl->admin_tag_set);
> 689?????????nvme_rdma_dev_put(ctrl->device);

This put paired with the get in?nvme_rdma_configure_admin_queue()

> 
> and one for the I/O section (assuming "if (ctrl->ctrl.tagset)"
> evaluate
> s to TRUE in that call):
> 
> 691?????????if (ctrl->ctrl.tagset) {
> 692?????????????????blk_cleanup_queue(ctrl->ctrl.connect_q);
> 693?????????????????blk_mq_free_tag_set(&ctrl->tag_set);
> 694?????????????????nvme_rdma_dev_put(ctrl->device);

This put paired with the get in?nvme_rdma_create_io_queues()

> 695?????????}
> 
> 
> My assumption would be that the correct path for this case would be
> such that?
> 
> nvme_rdma_dev_put(ctrl->device);?
> 
> would only be called one time, for a single device.

As above, need put for each get.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
  2016-07-13 21:33   ` Steve Wise
  2016-07-13 23:19   ` J Freyensee
@ 2016-07-14  9:15   ` Sagi Grimberg
  2016-07-14  9:17     ` Sagi Grimberg
                       ` (2 more replies)
  2 siblings, 3 replies; 26+ messages in thread
From: Sagi Grimberg @ 2016-07-14  9:15 UTC (permalink / raw)


Hey Ming, Steve,


> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index f845304..0d3c227 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -671,9 +671,6 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl)
>   	nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe,
>   			sizeof(struct nvme_command), DMA_TO_DEVICE);
>   	nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
> -	blk_cleanup_queue(ctrl->ctrl.admin_q);
> -	blk_mq_free_tag_set(&ctrl->admin_tag_set);
> -	nvme_rdma_dev_put(ctrl->device);
>   }
>
>   static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
>   	list_del(&ctrl->list);
>   	mutex_unlock(&nvme_rdma_ctrl_mutex);
>
> +	blk_cleanup_queue(ctrl->ctrl.admin_q);
> +	blk_mq_free_tag_set(&ctrl->admin_tag_set);
> +	nvme_rdma_dev_put(ctrl->device);
> +
>   	if (ctrl->ctrl.tagset) {
>   		blk_cleanup_queue(ctrl->ctrl.connect_q);
>   		blk_mq_free_tag_set(&ctrl->tag_set);
>

This patch introduces asymmetry between create and destroy
of the admin queue. Does this alternative patch solve
the problem?

The patch changes the order of device removal flow from:
1. delete controller
2. destroy queue

to:
1. destroy queue
2. delete controller

Or more specifically:
1. own the controller deletion (make sure we are not
competing with anyone)
2. get rid of inflight reconnects (which also destroy and
create queues)
3. destroy the queue
4. safely queue controller deletion

Thoughts?

--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 21ecbf3f3603..36cb4e33175a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -169,7 +169,6 @@ MODULE_PARM_DESC(register_always,
  static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
                 struct rdma_cm_event *event);
  static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
-static int __nvme_rdma_del_ctrl(struct nvme_rdma_ctrl *ctrl);

  /* XXX: really should move to a generic header sooner or later.. */
  static inline void put_unaligned_le24(u32 val, u8 *p)
@@ -1325,30 +1324,36 @@ out_destroy_queue_ib:
  static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue)
  {
         struct nvme_rdma_ctrl *ctrl = queue->ctrl;
-       int ret, ctrl_deleted = 0;
+       int ret;

-       /* First disable the queue so ctrl delete won't free it */
-       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
-               goto out;
+       /* Own the controller deletion */
+       if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING))
+               return 0;

-       /* delete the controller */
-       ret = __nvme_rdma_del_ctrl(ctrl);
-       if (!ret) {
-               dev_warn(ctrl->ctrl.device,
-                       "Got rdma device removal event, deleting ctrl\n");
-               flush_work(&ctrl->delete_work);
+       dev_warn(ctrl->ctrl.device,
+               "Got rdma device removal event, deleting ctrl\n");

-               /* Return non-zero so the cm_id will destroy implicitly */
-               ctrl_deleted = 1;
+       /* Get rid of reconnect work if its running */
+       cancel_delayed_work_sync(&ctrl->reconnect_work);

-               /* Free this queue ourselves */
-               rdma_disconnect(queue->cm_id);
-               ib_drain_qp(queue->qp);
-               nvme_rdma_destroy_queue_ib(queue);
+       /* Disable the queue so ctrl delete won't free it */
+       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) {
+               ret = 0;
+               goto queue_delete;
         }

-out:
-       return ctrl_deleted;
+       /* Free this queue ourselves */
+       nvme_rdma_stop_queue(queue);
+       nvme_rdma_destroy_queue_ib(queue);
+
+       /* Return non-zero so the cm_id will destroy implicitly */
+       ret = 1;
+
+queue_delete:
+       /* queue controller deletion */
+       queue_work(nvme_rdma_wq, &ctrl->delete_work);
+       flush_work(&ctrl->delete_work);
+       return ret;
  }

  static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
--

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14  9:15   ` Sagi Grimberg
@ 2016-07-14  9:17     ` Sagi Grimberg
  2016-07-14 14:30       ` Steve Wise
  2016-07-14 14:59     ` Steve Wise
       [not found]     ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
  2 siblings, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2016-07-14  9:17 UTC (permalink / raw)



> This patch introduces asymmetry between create and destroy
> of the admin queue. Does this alternative patch solve
> the problem?
>
> The patch changes the order of device removal flow from:
> 1. delete controller
> 2. destroy queue
>
> to:
> 1. destroy queue
> 2. delete controller
>
> Or more specifically:
> 1. own the controller deletion (make sure we are not
> competing with anyone)
> 2. get rid of inflight reconnects (which also destroy and
> create queues)
> 3. destroy the queue
> 4. safely queue controller deletion
>
> Thoughts?

Should mention that patch 1 is not needed as well with this...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14  9:17     ` Sagi Grimberg
@ 2016-07-14 14:30       ` Steve Wise
  2016-07-14 14:44         ` Sagi Grimberg
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-14 14:30 UTC (permalink / raw)




> -----Original Message-----
> From: Sagi Grimberg [mailto:sagi at grimberg.me]
> Sent: Thursday, July 14, 2016 4:18 AM
> To: Ming Lin; linux-nvme at lists.infradead.org
> Cc: Christoph Hellwig; Steve Wise; Jens Axboe
> Subject: Re: [PATCH 2/2] nvme-rdma: move admin queue cleanup to
> nvme_rdma_free_ctrl
> 
> 
> > This patch introduces asymmetry between create and destroy
> > of the admin queue. Does this alternative patch solve
> > the problem?
> >
> > The patch changes the order of device removal flow from:
> > 1. delete controller
> > 2. destroy queue
> >
> > to:
> > 1. destroy queue
> > 2. delete controller
> >
> > Or more specifically:
> > 1. own the controller deletion (make sure we are not
> > competing with anyone)
> > 2. get rid of inflight reconnects (which also destroy and
> > create queues)
> > 3. destroy the queue
> > 4. safely queue controller deletion
> >
> > Thoughts?
> 
> Should mention that patch 1 is not needed as well with this...

Hey Sagi, This patch won't apply.  I think it is munged in the email.  Can you
please email me the patch as an attachment? 

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14 14:30       ` Steve Wise
@ 2016-07-14 14:44         ` Sagi Grimberg
  0 siblings, 0 replies; 26+ messages in thread
From: Sagi Grimberg @ 2016-07-14 14:44 UTC (permalink / raw)



> Hey Sagi, This patch won't apply.  I think it is munged in the email.  Can you
> please email me the patch as an attachment?

Attached
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-nvme-rdma-Fix-device-removal-handling.patch
Type: text/x-patch
Size: 3828 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160714/1ac4aaaa/attachment.bin>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14  9:15   ` Sagi Grimberg
  2016-07-14  9:17     ` Sagi Grimberg
@ 2016-07-14 14:59     ` Steve Wise
       [not found]     ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
  2 siblings, 0 replies; 26+ messages in thread
From: Steve Wise @ 2016-07-14 14:59 UTC (permalink / raw)


> This patch introduces asymmetry between create and destroy
> of the admin queue. Does this alternative patch solve
> the problem?
> 
> The patch changes the order of device removal flow from:
> 1. delete controller
> 2. destroy queue
> 
> to:
> 1. destroy queue
> 2. delete controller
> 
> Or more specifically:
> 1. own the controller deletion (make sure we are not
> competing with anyone)
> 2. get rid of inflight reconnects (which also destroy and
> create queues)
> 3. destroy the queue
> 4. safely queue controller deletion
> 
> Thoughts?
> 

Your patch causes a deadlock during device removal.  

The controller delete work thread is stuck in c4iw_destroy_qp waiting on all
references to go away.  Either nvmf/rdma or the rdma-cm or both.

(gdb) list *c4iw_destroy_qp+0x155
0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596).
1591                    c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE,
&attrs, 0);
1592            wait_event(qhp->wait, !qhp->ep);
1593
1594            remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
1595            atomic_dec(&qhp->refcnt);
1596            wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
1597
1598            spin_lock_irq(&rhp->lock);
1599            if (!list_empty(&qhp->db_fc_entry))
1600                    list_del_init(&qhp->db_fc_entry);

The rdma-cm work thread is stuck trying to grab the cm_id mutex:

(gdb) list *cma_disable_callback+0x2e
0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715).
710
711     static int cma_disable_callback(struct rdma_id_private *id_priv,
712                                     enum rdma_cm_state state)
713     {
714             mutex_lock(&id_priv->handler_mutex);
715             if (id_priv->state != state) {
716                     mutex_unlock(&id_priv->handler_mutex);
717                     return -EINVAL;
718             }
719             return 0;

And the nvmf cm event handler is stuck waiting for the controller delete to
finish:

(gdb) list *nvme_rdma_device_unplug+0x97
0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358).
warning: Source file is more recent than executable.
1353    queue_delete:
1354            /* queue controller deletion */
1355            queue_work(nvme_rdma_wq, &ctrl->delete_work);
1356            flush_work(&ctrl->delete_work);
1357            return ret;
1358    }
1359
1360    static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
1361                    struct rdma_cm_event *ev)
1362    {

Seems like the rdma-cm work thread is trying to grab the cm_id lock for the
cm_id that is handling the DEVICE_REMOVAL event.


Here is the detailed log:

[ 1484.568354] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery",
addr 10.0.1.14:4420
[ 1495.939850] nvme nvme1: creating 32 I/O queues.
[ 1496.635114] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420
[ 1502.813698] nvme nvme1: Got rdma device removal event, deleting ctrl
[ 1683.259465] INFO: task kworker/0:0:4 blocked for more than 120 seconds.
[ 1683.266147]       Tainted: G            E   4.7.0-rc2-block-for-next+ #78
[ 1683.272986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 1683.280830] kworker/0:0     D ffff8810281579c8     0     4      2 0x10000000
[ 1683.287997] Workqueue: nvme_rdma_wq nvme_rdma_del_ctrl_work [nvme_rdma]
[ 1683.294658]  ffff8810281579c8 ffffffff81c0d4c0 ffff88102809c6c0
ffff880fff04e898
[ 1683.302252]  ffffffff81d448b4 ffff88102809d840 ffff881028154008
ffff88007853b3f8
[ 1683.309763]  ffff881028157ac8 ffff880f67ea5758 ffff880f67ea5760
ffff881028157a18
[ 1683.317281] Call Trace:
[ 1683.319752]  [<ffffffff816eaed0>] schedule+0x40/0xb0
[ 1683.324757]  [<ffffffffa036bac5>] c4iw_destroy_qp+0x155/0x2b0 [iw_cxgb4]
[ 1683.331481]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[ 1683.336639]  [<ffffffff810f515a>] ? lock_timer_base+0x5a/0x80
[ 1683.342402]  [<ffffffff816ede49>] ? schedule_timeout+0x1c9/0x220
[ 1683.348438]  [<ffffffff810ccd80>] ? wait_woken+0x80/0x80
[ 1683.353804]  [<ffffffffa03a25a9>] ? ib_mr_pool_destroy+0x79/0x90 [ib_core]
[ 1683.360699]  [<ffffffffa039893e>] ib_destroy_qp+0xbe/0x160 [ib_core]
[ 1683.367085]  [<ffffffffa0706011>] rdma_destroy_qp+0x31/0x50 [rdma_cm]
[ 1683.373538]  [<ffffffffa038ed99>] nvme_rdma_destroy_queue_ib+0x29/0x90
[nvme_rdma]
[ 1683.381127]  [<ffffffffa038f04d>] nvme_rdma_stop_and_free_queue+0x2d/0x40
[nvme_rdma]
[ 1683.388967]  [<ffffffffa038f157>] nvme_rdma_shutdown_ctrl+0x87/0xe0
[nvme_rdma]
[ 1683.396290]  [<ffffffffa038f224>] nvme_rdma_del_ctrl_work+0x34/0x50
[nvme_rdma]
[ 1683.403629]  [<ffffffff810a4d03>] process_one_work+0x183/0x4d0
[ 1683.409485]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[ 1683.414984]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[ 1683.420142]  [<ffffffff810c4130>] ? dequeue_entity+0x120/0x250
[ 1683.426009]  [<ffffffff810a59bd>] worker_thread+0x16d/0x530
[ 1683.431597]  [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0
[ 1683.437176]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[ 1683.442674]  [<ffffffff810ccbc6>] ? __wake_up_common+0x56/0x90
[ 1683.448511]  [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120
[ 1683.454803]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[ 1683.459951]  [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120
[ 1683.466223]  [<ffffffff810aa5dc>] kthread+0xcc/0xf0
[ 1683.471106]  [<ffffffff810b4ffe>] ? schedule_tail+0x1e/0xc0
[ 1683.476699]  [<ffffffff810ae3b0>] ? smpboot_create_threads+0x80/0x80
[ 1683.483053]  [<ffffffff816eed3f>] ret_from_fork+0x1f/0x40
[ 1683.488470]  [<ffffffff810aa510>] ? kthread_freezable_should_stop+0x70/0x70
[ 1683.495476] INFO: task kworker/u64:5:716 blocked for more than 120 seconds.
[ 1683.502441]       Tainted: G            E   4.7.0-rc2-block-for-next+ #78
[ 1683.509237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 1683.517096] kworker/u64:5   D ffff88101d1ff858     0   716      2 0x10000000
[ 1683.524223] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
[ 1683.529467]  ffff88101d1ff858 ffff88102819c6c0 ffff88101d200040
ffff880fff04e898
[ 1683.536954]  ffff88101d200040 ffff88101d200040 ffff88101d1fc008
ffff880f67a0720c
[ 1683.544445]  ffff88101d200040 ffff880f67a07210 ffff88101d1ff918
ffff88101d1ff8a8
[ 1683.551932] Call Trace:
[ 1683.554390]  [<ffffffff816eaed0>] schedule+0x40/0xb0
[ 1683.559358]  [<ffffffff811119ae>] ? is_module_text_address+0xe/0x20
[ 1683.565630]  [<ffffffff816eb12e>] schedule_preempt_disabled+0xe/0x10
[ 1683.571990]  [<ffffffff816ecbe3>] __mutex_lock_slowpath+0x93/0x120
[ 1683.578200]  [<ffffffff813789bb>] ? find_next_bit+0xb/0x10
[ 1683.583694]  [<ffffffff81362816>] ? cpumask_next_and+0x26/0x50
[ 1683.589529]  [<ffffffff810c4baa>] ? find_busiest_group+0x14a/0xa00
[ 1683.595713]  [<ffffffff816ecc93>] mutex_lock+0x23/0x40
[ 1683.600871]  [<ffffffffa070326e>] cma_disable_callback+0x2e/0x60 [rdma_cm]
[ 1683.607760]  [<ffffffffa0707b5b>] cma_iw_handler+0x2b/0x1b0 [rdma_cm]
[ 1683.614207]  [<ffffffffa06f61b3>] cm_close_handler+0x93/0xc0 [iw_cm]
[ 1683.620561]  [<ffffffffa06f80e7>] process_event+0xd7/0xf0 [iw_cm]
[ 1683.626665]  [<ffffffffa06f8247>] cm_work_handler+0x147/0x1c0 [iw_cm]
[ 1683.633121]  [<ffffffff810a54f6>] ?
trace_event_raw_event_workqueue_execute_start+0x66/0xa0
[ 1683.641487]  [<ffffffff810a4d03>] process_one_work+0x183/0x4d0
[ 1683.647334]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[ 1683.652831]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[ 1683.657977]  [<ffffffff816eae91>] ? schedule+0x1/0xb0
[ 1683.663038]  [<ffffffff810a59bd>] worker_thread+0x16d/0x530
[ 1683.668612]  [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0
[ 1683.674197]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[ 1683.679690]  [<ffffffff810ccbc6>] ? __wake_up_common+0x56/0x90
[ 1683.685534]  [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120
[ 1683.691821]  [<ffffffff816eaed0>] ? schedule+0x40/0xb0
[ 1683.696968]  [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120
[ 1683.703230]  [<ffffffff810aa5dc>] kthread+0xcc/0xf0
[ 1683.708126]  [<ffffffff810b4ffe>] ? schedule_tail+0x1e/0xc0
[ 1683.713708]  [<ffffffff816eed3f>] ret_from_fork+0x1f/0x40
[ 1683.719118]  [<ffffffff810aa510>] ? kthread_freezable_should_stop+0x70/0x70
[ 1683.726126] INFO: task rmmod:6006 blocked for more than 120 seconds.
[ 1683.732483]       Tainted: G            E   4.7.0-rc2-block-for-next+ #78
[ 1683.739278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 1683.747111] rmmod           D ffff880f647479f8     0  6006   4227 0x10000080
[ 1683.754210]  ffff880f647479f8 ffff88102819af40 ffff880f647017c0
ffff880fff04e898
[ 1683.761729]  ffffffff8116c656 0000000000000001 ffff880f64744008
ffff880f64747b98
[ 1683.769244]  7fffffffffffffff ffff880f647017c0 7fffffffffffffff
ffff880f64747a48
[ 1683.776768] Call Trace:
[ 1683.779229]  [<ffffffff8116c656>] ? trace_event_buffer_commit+0x146/0x1d0
[ 1683.786037]  [<ffffffff816eaed0>] schedule+0x40/0xb0
[ 1683.791009]  [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0
[ 1683.796585]  [<ffffffff816ede49>] schedule_timeout+0x1c9/0x220
[ 1683.802430]  [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0
[ 1683.807922]  [<ffffffff810a3603>] ? insert_work+0x53/0xb0
[ 1683.813341]  [<ffffffff816ebede>] wait_for_completion+0xde/0x110
[ 1683.819386]  [<ffffffff810b6b00>] ? try_to_wake_up+0x230/0x230
[ 1683.825236]  [<ffffffff810a472b>] flush_work+0x2b/0x40
[ 1683.830379]  [<ffffffff810a1810>] ? worker_detach_from_pool+0xa0/0xa0
[ 1683.836830]  [<ffffffffa038eff7>] nvme_rdma_device_unplug+0x97/0xc0
[nvme_rdma]
[ 1683.844142]  [<ffffffffa0390782>] nvme_rdma_cm_handler+0x72/0x2f0 [nvme_rdma]
[ 1683.851297]  [<ffffffff812085c4>] ? kmem_cache_free+0x1f4/0x210
[ 1683.857230]  [<ffffffffa0703669>] ? cma_comp+0x49/0x60 [rdma_cm]
[ 1683.863247]  [<ffffffffa070810f>] cma_remove_id_dev+0x8f/0xa0 [rdma_cm]
[ 1683.869865]  [<ffffffffa07081d7>] cma_process_remove+0xb7/0x100 [rdma_cm]
[ 1683.876681]  [<ffffffff812b2a14>] ? __kernfs_remove+0x114/0x1d0
[ 1683.882608]  [<ffffffffa070825e>] cma_remove_one+0x3e/0x60 [rdma_cm]
[ 1683.888983]  [<ffffffffa039dca0>] ib_unregister_device+0xb0/0x150 [ib_core]
[ 1683.895957]  [<ffffffffa0363034>] c4iw_unregister_device+0x64/0x90 [iw_cxgb4]
[ 1683.903109]  [<ffffffffa0356357>] c4iw_remove+0x27/0x60 [iw_cxgb4]
[ 1683.909303]  [<ffffffffa036e6fd>] c4iw_exit_module+0x31/0x934 [iw_cxgb4]
[ 1683.916018]  [<ffffffff811120e3>] SyS_delete_module+0x183/0x1d0
[ 1683.921955]  [<ffffffff81003476>] ? do_audit_syscall_entry+0x66/0x70
[ 1683.928314]  [<ffffffff8100378c>] ? trace_event_raw_event_sys_exit+0x6c/0xa0
[ 1683.935357]  [<ffffffff81003b45>] ? syscall_trace_enter+0x65/0x70
[ 1683.941501]  [<ffffffff81003f08>] do_syscall_64+0x78/0x1d0
[ 1683.947003]  [<ffffffff8106e367>] ? do_page_fault+0x37/0x90
[ 1683.952587]  [<ffffffff816eebe1>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
       [not found]     ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
@ 2016-07-14 15:02       ` Steve Wise
  2016-07-14 15:26         ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-14 15:02 UTC (permalink / raw)


> > This patch introduces asymmetry between create and destroy
> > of the admin queue. Does this alternative patch solve
> > the problem?
> >
> > The patch changes the order of device removal flow from:
> > 1. delete controller
> > 2. destroy queue
> >
> > to:
> > 1. destroy queue
> > 2. delete controller
> >
> > Or more specifically:
> > 1. own the controller deletion (make sure we are not
> > competing with anyone)
> > 2. get rid of inflight reconnects (which also destroy and
> > create queues)
> > 3. destroy the queue
> > 4. safely queue controller deletion
> >
> > Thoughts?
> >
> 
> Your patch causes a deadlock during device removal.
> 
> The controller delete work thread is stuck in c4iw_destroy_qp waiting on
> all references to go away.  Either nvmf/rdma or the rdma-cm or both.
> 
> (gdb) list *c4iw_destroy_qp+0x155
> 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596).
> 1591                    c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE,
> &attrs, 0);
> 1592            wait_event(qhp->wait, !qhp->ep);
> 1593
> 1594            remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
> 1595            atomic_dec(&qhp->refcnt);
> 1596            wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
> 1597
> 1598            spin_lock_irq(&rhp->lock);
> 1599            if (!list_empty(&qhp->db_fc_entry))
> 1600                    list_del_init(&qhp->db_fc_entry);
> 
> The rdma-cm work thread is stuck trying to grab the cm_id mutex:
> 
> (gdb) list *cma_disable_callback+0x2e
> 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715).
> 710
> 711     static int cma_disable_callback(struct rdma_id_private *id_priv,
> 712                                     enum rdma_cm_state state)
> 713     {
> 714             mutex_lock(&id_priv->handler_mutex);
> 715             if (id_priv->state != state) {
> 716                     mutex_unlock(&id_priv->handler_mutex);
> 717                     return -EINVAL;
> 718             }
> 719             return 0;
> 
> And the nvmf cm event handler is stuck waiting for the controller delete
> to finish:
> 
> (gdb) list *nvme_rdma_device_unplug+0x97
> 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358).
> warning: Source file is more recent than executable.
> 1353    queue_delete:
> 1354            /* queue controller deletion */
> 1355            queue_work(nvme_rdma_wq, &ctrl->delete_work);
> 1356            flush_work(&ctrl->delete_work);
> 1357            return ret;
> 1358    }
> 1359
> 1360    static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> 1361                    struct rdma_cm_event *ev)
> 1362    {
> 
> Seems like the rdma-cm work thread is trying to grab the cm_id lock for
> the cm_id that is handling the DEVICE_REMOVAL event.
> 

And, the nvmf/rdma delete controller work thread is trying to delete the cm_id
that received the DEVICE_REMOVAL event, which is the crux o' the biscuit,
methinks...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14 15:02       ` Steve Wise
@ 2016-07-14 15:26         ` Steve Wise
  2016-07-14 21:27           ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-14 15:26 UTC (permalink / raw)


> > > This patch introduces asymmetry between create and destroy
> > > of the admin queue. Does this alternative patch solve
> > > the problem?
> > >
> > > The patch changes the order of device removal flow from:
> > > 1. delete controller
> > > 2. destroy queue
> > >
> > > to:
> > > 1. destroy queue
> > > 2. delete controller
> > >
> > > Or more specifically:
> > > 1. own the controller deletion (make sure we are not
> > > competing with anyone)
> > > 2. get rid of inflight reconnects (which also destroy and
> > > create queues)
> > > 3. destroy the queue
> > > 4. safely queue controller deletion
> > >
> > > Thoughts?
> > >
> >
> > Your patch causes a deadlock during device removal.
> >
> > The controller delete work thread is stuck in c4iw_destroy_qp waiting on
> > all references to go away.  Either nvmf/rdma or the rdma-cm or both.
> >
> > (gdb) list *c4iw_destroy_qp+0x155
> > 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596).
> > 1591                    c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE,
> > &attrs, 0);
> > 1592            wait_event(qhp->wait, !qhp->ep);
> > 1593
> > 1594            remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
> > 1595            atomic_dec(&qhp->refcnt);
> > 1596            wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
> > 1597
> > 1598            spin_lock_irq(&rhp->lock);
> > 1599            if (!list_empty(&qhp->db_fc_entry))
> > 1600                    list_del_init(&qhp->db_fc_entry);
> >
> > The rdma-cm work thread is stuck trying to grab the cm_id mutex:
> >
> > (gdb) list *cma_disable_callback+0x2e
> > 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715).
> > 710
> > 711     static int cma_disable_callback(struct rdma_id_private *id_priv,
> > 712                                     enum rdma_cm_state state)
> > 713     {
> > 714             mutex_lock(&id_priv->handler_mutex);
> > 715             if (id_priv->state != state) {
> > 716                     mutex_unlock(&id_priv->handler_mutex);
> > 717                     return -EINVAL;
> > 718             }
> > 719             return 0;
> >
> > And the nvmf cm event handler is stuck waiting for the controller delete
> > to finish:
> >
> > (gdb) list *nvme_rdma_device_unplug+0x97
> > 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358).
> > warning: Source file is more recent than executable.
> > 1353    queue_delete:
> > 1354            /* queue controller deletion */
> > 1355            queue_work(nvme_rdma_wq, &ctrl->delete_work);
> > 1356            flush_work(&ctrl->delete_work);
> > 1357            return ret;
> > 1358    }
> > 1359
> > 1360    static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> > 1361                    struct rdma_cm_event *ev)
> > 1362    {
> >
> > Seems like the rdma-cm work thread is trying to grab the cm_id lock for
> > the cm_id that is handling the DEVICE_REMOVAL event.
> >
> 
> And, the nvmf/rdma delete controller work thread is trying to delete the cm_id
> that received the DEVICE_REMOVAL event, which is the crux o' the biscuit,
> methinks...
> 

Correction: the del controller work thread is trying to destroy the qp
associated with the cm_id.  But the point is this cm_id/qp should NOT be touched
by the del controller thread because the unplug thread should have cleared the
Q_CONNECTED bit and thus took ownership of destroy it.  I'll add some debug
prints to see which path is being taken by nvme_rdma_device_unplug().

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14  6:39         ` Ming Lin
@ 2016-07-14 17:09           ` J Freyensee
  2016-07-14 18:04             ` Ming Lin
  0 siblings, 1 reply; 26+ messages in thread
From: J Freyensee @ 2016-07-14 17:09 UTC (permalink / raw)


On Wed, 2016-07-13@23:39 -0700, Ming Lin wrote:
> On Wed, 2016-07-13@16:59 -0700, J Freyensee wrote:
> > On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote:
> > > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee
> > > <james_p_freyensee@linux.intel.com> wrote:
> > > > 
> > > > >  static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> > > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct
> > > > > nvme_ctrl
> > > > > *nctrl)
> > > > >       list_del(&ctrl->list);
> > > > >       mutex_unlock(&nvme_rdma_ctrl_mutex);
> > > > > 
> > > > > +     blk_cleanup_queue(ctrl->ctrl.admin_q);
> > > > > +     blk_mq_free_tag_set(&ctrl->admin_tag_set);
> > > > > +     nvme_rdma_dev_put(ctrl->device);
> > > > > +
> > > > >       if (ctrl->ctrl.tagset) {
> > > > >               blk_cleanup_queue(ctrl->ctrl.connect_q);
> > > > >               blk_mq_free_tag_set(&ctrl->tag_set);
> > > > 
> > > > This patch does not remove the second
> > > > 
> > > > nvme_rdma_dev_put(ctrl->device);
> > > > 
> > > > call that happens within the if() statement above if it
> > > > evaluates
> > > > to
> > > > TRUE.  Should that have been removed or moved elsewhere?
> > > 
> > > Not sure I understand your question.
> > > Did you mean line 694?
> > 
> > Yes I mean line 694.
> > 
> > > 
> > > For discovery controller, there is no IO queues. So ctrl-
> > > > ctrl.tagset
> > > is NULL.
> > > 
> > > The first bulk of
> > > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for
> > > admin
> > > queue.
> > > And the second is for IO queues.
> > 
> > I'm just confused when nvme_free_ctrl() in core.c calls:
> > 
> > ctrl->ops->free_ctrl(ctrl);
> > 
> > which looks like would be the only call that would free both the
> > admin
> > and I/O rdma queues, why there would be the potential to do a
> > _put()
> > twice in nvme_rdma_free_ctrl() via:
> > 
> > nvme_rdma_dev_put(ctrl->device);
> > 
> > one for the admin section:
> > 
> > 687         blk_cleanup_queue(ctrl>ctrl.admin_q);
> > 688         blk_mq_free_tag_set(&ctrl->admin_tag_set);
> > 689         nvme_rdma_dev_put(ctrl->device);
> 
> This put paired with the get in nvme_rdma_configure_admin_queue()
> 
> > 
> > and one for the I/O section (assuming "if (ctrl->ctrl.tagset)"
> > evaluate
> > s to TRUE in that call):
> > 
> > 691         if (ctrl->ctrl.tagset) {
> > 692                 blk_cleanup_queue(ctrl->ctrl.connect_q);
> > 693                 blk_mq_free_tag_set(&ctrl->tag_set);
> > 694                 nvme_rdma_dev_put(ctrl->device);
> 
> This put paired with the get in nvme_rdma_create_io_queues()
> 
> > 695         }
> > 
> > 
> > My assumption would be that the correct path for this case would be
> > such that 
> > 
> > nvme_rdma_dev_put(ctrl->device); 
> > 
> > would only be called one time, for a single device.
> 
> As above, need put for each get.

OK, thanks for the clarification on this.

Would it then be worth to have an if() check around the new admin
logic, as the assumption with the added code is it will always need to
free the admin_queue when called, something like:

if (ctrl->admin_tag_set) {
	blk_cleanup_queue(ctrl->ctrl.admin_q);
	blk_mq_free_tag_set(&ctrl->admin_tag_set);
	nvme_rdma_dev_put(ctrl->device); 
}

blk_mq_free_tag_set() does assume its parameter is not NULL.



> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14 17:09           ` J Freyensee
@ 2016-07-14 18:04             ` Ming Lin
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lin @ 2016-07-14 18:04 UTC (permalink / raw)


On Thu, Jul 14, 2016 at 10:09 AM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
> On Wed, 2016-07-13@23:39 -0700, Ming Lin wrote:
>> On Wed, 2016-07-13@16:59 -0700, J Freyensee wrote:
>> > On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote:
>> > > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee
>> > > <james_p_freyensee@linux.intel.com> wrote:
>> > > >
>> > > > >  static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
>> > > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct
>> > > > > nvme_ctrl
>> > > > > *nctrl)
>> > > > >       list_del(&ctrl->list);
>> > > > >       mutex_unlock(&nvme_rdma_ctrl_mutex);
>> > > > >
>> > > > > +     blk_cleanup_queue(ctrl->ctrl.admin_q);
>> > > > > +     blk_mq_free_tag_set(&ctrl->admin_tag_set);
>> > > > > +     nvme_rdma_dev_put(ctrl->device);
>> > > > > +
>> > > > >       if (ctrl->ctrl.tagset) {
>> > > > >               blk_cleanup_queue(ctrl->ctrl.connect_q);
>> > > > >               blk_mq_free_tag_set(&ctrl->tag_set);
>> > > >
>> > > > This patch does not remove the second
>> > > >
>> > > > nvme_rdma_dev_put(ctrl->device);
>> > > >
>> > > > call that happens within the if() statement above if it
>> > > > evaluates
>> > > > to
>> > > > TRUE.  Should that have been removed or moved elsewhere?
>> > >
>> > > Not sure I understand your question.
>> > > Did you mean line 694?
>> >
>> > Yes I mean line 694.
>> >
>> > >
>> > > For discovery controller, there is no IO queues. So ctrl-
>> > > > ctrl.tagset
>> > > is NULL.
>> > >
>> > > The first bulk of
>> > > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for
>> > > admin
>> > > queue.
>> > > And the second is for IO queues.
>> >
>> > I'm just confused when nvme_free_ctrl() in core.c calls:
>> >
>> > ctrl->ops->free_ctrl(ctrl);
>> >
>> > which looks like would be the only call that would free both the
>> > admin
>> > and I/O rdma queues, why there would be the potential to do a
>> > _put()
>> > twice in nvme_rdma_free_ctrl() via:
>> >
>> > nvme_rdma_dev_put(ctrl->device);
>> >
>> > one for the admin section:
>> >
>> > 687         blk_cleanup_queue(ctrl>ctrl.admin_q);
>> > 688         blk_mq_free_tag_set(&ctrl->admin_tag_set);
>> > 689         nvme_rdma_dev_put(ctrl->device);
>>
>> This put paired with the get in nvme_rdma_configure_admin_queue()
>>
>> >
>> > and one for the I/O section (assuming "if (ctrl->ctrl.tagset)"
>> > evaluate
>> > s to TRUE in that call):
>> >
>> > 691         if (ctrl->ctrl.tagset) {
>> > 692                 blk_cleanup_queue(ctrl->ctrl.connect_q);
>> > 693                 blk_mq_free_tag_set(&ctrl->tag_set);
>> > 694                 nvme_rdma_dev_put(ctrl->device);
>>
>> This put paired with the get in nvme_rdma_create_io_queues()
>>
>> > 695         }
>> >
>> >
>> > My assumption would be that the correct path for this case would be
>> > such that
>> >
>> > nvme_rdma_dev_put(ctrl->device);
>> >
>> > would only be called one time, for a single device.
>>
>> As above, need put for each get.
>
> OK, thanks for the clarification on this.
>
> Would it then be worth to have an if() check around the new admin
> logic, as the assumption with the added code is it will always need to
> free the admin_queue when called, something like:

No, because admin_tag_set is always not NULL.

>
> if (ctrl->admin_tag_set) {
>         blk_cleanup_queue(ctrl->ctrl.admin_q);
>         blk_mq_free_tag_set(&ctrl->admin_tag_set);
>         nvme_rdma_dev_put(ctrl->device);
> }
>
> blk_mq_free_tag_set() does assume its parameter is not NULL.

Sagi's patch is better than these two.
We'll go for it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14 15:26         ` Steve Wise
@ 2016-07-14 21:27           ` Steve Wise
  2016-07-15 15:52             ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-14 21:27 UTC (permalink / raw)


> > > > This patch introduces asymmetry between create and destroy
> > > > of the admin queue. Does this alternative patch solve
> > > > the problem?
> > > >
> > > > The patch changes the order of device removal flow from:
> > > > 1. delete controller
> > > > 2. destroy queue
> > > >
> > > > to:
> > > > 1. destroy queue
> > > > 2. delete controller
> > > >
> > > > Or more specifically:
> > > > 1. own the controller deletion (make sure we are not
> > > > competing with anyone)
> > > > 2. get rid of inflight reconnects (which also destroy and
> > > > create queues)
> > > > 3. destroy the queue
> > > > 4. safely queue controller deletion
> > > >
> > > > Thoughts?
> > > >
> > >
> > > Your patch causes a deadlock during device removal.
> > >
> > > The controller delete work thread is stuck in c4iw_destroy_qp waiting on
> > > all references to go away.  Either nvmf/rdma or the rdma-cm or both.
> > >
> > > (gdb) list *c4iw_destroy_qp+0x155
> > > 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596).
> > > 1591                    c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE,
> > > &attrs, 0);
> > > 1592            wait_event(qhp->wait, !qhp->ep);
> > > 1593
> > > 1594            remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
> > > 1595            atomic_dec(&qhp->refcnt);
> > > 1596            wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
> > > 1597
> > > 1598            spin_lock_irq(&rhp->lock);
> > > 1599            if (!list_empty(&qhp->db_fc_entry))
> > > 1600                    list_del_init(&qhp->db_fc_entry);
> > >
> > > The rdma-cm work thread is stuck trying to grab the cm_id mutex:
> > >
> > > (gdb) list *cma_disable_callback+0x2e
> > > 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715).
> > > 710
> > > 711     static int cma_disable_callback(struct rdma_id_private *id_priv,
> > > 712                                     enum rdma_cm_state state)
> > > 713     {
> > > 714             mutex_lock(&id_priv->handler_mutex);
> > > 715             if (id_priv->state != state) {
> > > 716                     mutex_unlock(&id_priv->handler_mutex);
> > > 717                     return -EINVAL;
> > > 718             }
> > > 719             return 0;
> > >
> > > And the nvmf cm event handler is stuck waiting for the controller delete
> > > to finish:
> > >
> > > (gdb) list *nvme_rdma_device_unplug+0x97
> > > 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358).
> > > warning: Source file is more recent than executable.
> > > 1353    queue_delete:
> > > 1354            /* queue controller deletion */
> > > 1355            queue_work(nvme_rdma_wq, &ctrl->delete_work);
> > > 1356            flush_work(&ctrl->delete_work);
> > > 1357            return ret;
> > > 1358    }
> > > 1359
> > > 1360    static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> > > 1361                    struct rdma_cm_event *ev)
> > > 1362    {
> > >
> > > Seems like the rdma-cm work thread is trying to grab the cm_id lock for
> > > the cm_id that is handling the DEVICE_REMOVAL event.
> > >
> >
> > And, the nvmf/rdma delete controller work thread is trying to delete the
cm_id
> > that received the DEVICE_REMOVAL event, which is the crux o' the biscuit,
> > methinks...
> >
> 
> Correction: the del controller work thread is trying to destroy the qp
> associated with the cm_id.  But the point is this cm_id/qp should NOT be
touched
> by the del controller thread because the unplug thread should have cleared the
> Q_CONNECTED bit and thus took ownership of destroy it.  I'll add some debug
> prints to see which path is being taken by nvme_rdma_device_unplug().
> 

After further debug, the del controller work thread is not trying to destroy the
qp/cm_id that received the event.  That qp/cm_id is successfully deleted by the
unplug thread.  However the first cm_id/qp that is destroyed by the del
controller work thread gets stuck in c4iw_destroy_qp() due to the deadlock.   So
I need to understand more about the deadlock...

Steve.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-14 21:27           ` Steve Wise
@ 2016-07-15 15:52             ` Steve Wise
  2016-07-17  6:01               ` Sagi Grimberg
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-15 15:52 UTC (permalink / raw)


> > Correction: the del controller work thread is trying to destroy the qp
> > associated with the cm_id.  But the point is this cm_id/qp should NOT be
> touched
> > by the del controller thread because the unplug thread should have cleared
the
> > Q_CONNECTED bit and thus took ownership of destroy it.  I'll add some debug
> > prints to see which path is being taken by nvme_rdma_device_unplug().
> >
> 
> After further debug, the del controller work thread is not trying to destroy
the
> qp/cm_id that received the event.  That qp/cm_id is successfully deleted by
the
> unplug thread.  However the first cm_id/qp that is destroyed by the del
> controller work thread gets stuck in c4iw_destroy_qp() due to the deadlock.
So
> I need to understand more about the deadlock...

Hey Sagi, here is some lite reading for you. :)

Prelude:  As part of disconnecting an iwarp connection, the iwarp provider needs
to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the
singlethread workq thread for iw_cm.  

Here is what happens with Sagi's patch:

nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
rdma_disconnect().  This triggers the disconnect.  iw_cxgb4 posts the
IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in the
iw_cm workq thread context.  cm_close_handler() calls the rdma_cm event handler
for this cm_id, function cm_iw_handler(), which blocks until any currently
running event handler for this cm_id finishes.  It does this by calling
cm_disable_callback().  However since this whole unplug process is running in
the event handler function for this same cm_id, the iw_cm workq thread is now
stuck in a deadlock.   nvme_rdma_device_unplug() however, continues on and
schedules the controller delete worker thread and waits for it to complete.  The
delete controller worker thread tries to disconnect and destroy all the
remaining IO queues, but gets stuck in the destroy() path on the first IO queue
because the iw_cm workq thread is already stuck, and processing the CLOSE event
is required to release a reference the iw_cm has on the iwarp providers qp.  So
everything comes to a grinding halt....

Now: Ming's 2 patches avoid this deadlock because the cm_id that received the
device removal event is disconnected/destroyed _only after_ all the controller
queues are disconnected/destroyed.  So nvme_rdma_device_unplug() doesn't get
stuck waiting for the controller to delete the io queues, and only after that
completes, does it delete the cm_id/qp that got the device removal event.  It
then returns thus causing the rdma_cm to release the cm_id's callback mutex.
This causes the iw_cm workq thread to now unblock and we continue on.  (can you
say house of cards?)

So the net is:  the cm_id that received the device remove event _must_ be
disconnect/destroyed _last_.  

Steve.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-15 15:52             ` Steve Wise
@ 2016-07-17  6:01               ` Sagi Grimberg
  2016-07-18 14:55                 ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2016-07-17  6:01 UTC (permalink / raw)



> Hey Sagi, here is some lite reading for you. :)
>
> Prelude:  As part of disconnecting an iwarp connection, the iwarp provider needs
> to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the
> singlethread workq thread for iw_cm.
>
> Here is what happens with Sagi's patch:
>
> nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> rdma_disconnect().  This triggers the disconnect.  iw_cxgb4 posts the
> IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in the
> iw_cm workq thread context.  cm_close_handler() calls the rdma_cm event handler
> for this cm_id, function cm_iw_handler(), which blocks until any currently
> running event handler for this cm_id finishes.  It does this by calling
> cm_disable_callback().  However since this whole unplug process is running in
> the event handler function for this same cm_id, the iw_cm workq thread is now
> stuck in a deadlock.   nvme_rdma_device_unplug() however, continues on and
> schedules the controller delete worker thread and waits for it to complete.  The
> delete controller worker thread tries to disconnect and destroy all the
> remaining IO queues, but gets stuck in the destroy() path on the first IO queue
> because the iw_cm workq thread is already stuck, and processing the CLOSE event
> is required to release a reference the iw_cm has on the iwarp providers qp.  So
> everything comes to a grinding halt....
>
> Now: Ming's 2 patches avoid this deadlock because the cm_id that received the
> device removal event is disconnected/destroyed _only after_ all the controller
> queues are disconnected/destroyed.  So nvme_rdma_device_unplug() doesn't get
> stuck waiting for the controller to delete the io queues, and only after that
> completes, does it delete the cm_id/qp that got the device removal event.  It
> then returns thus causing the rdma_cm to release the cm_id's callback mutex.
> This causes the iw_cm workq thread to now unblock and we continue on.  (can you
> say house of cards?)
>
> So the net is:  the cm_id that received the device remove event _must_ be
> disconnect/destroyed _last_.

Hey Steve, thanks for the detailed description.

IMHO, this really looks like a buggy design in iWARP connection
management implementation. The fact that a rdma_disconnect forward
progress is dependent on other cm_id execution looks wrong and
backwards to me.

Given that the device is being removed altogether I don't think that
calling rdma_disconnect is important at all. Given the fact that
ib_drain_qp makes sure that the queue-pair is in error state does
this incremental patch resolve the iwarp-deadlock you are seeing?

--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index fd90b5c00aae..e7eec7d8c705 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1342,8 +1342,12 @@ static int nvme_rdma_device_unplug(struct 
nvme_rdma_queue *queue)
                 goto queue_delete;
         }

+       /*
+        * iwcm does not handle rdma_disconnect within DEVICE_REMOVAL
+        * event very well, so we settle with qp drain
+        */
+       ib_drain_qp(queue->qp);
         /* Free this queue ourselves */
-       nvme_rdma_stop_queue(queue);
         nvme_rdma_destroy_queue_ib(queue);

         /* Return non-zero so the cm_id will destroy implicitly */
--

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-17  6:01               ` Sagi Grimberg
@ 2016-07-18 14:55                 ` Steve Wise
  2016-07-18 15:47                   ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-18 14:55 UTC (permalink / raw)


> > Hey Sagi, here is some lite reading for you. :)
> >
> > Prelude:  As part of disconnecting an iwarp connection, the iwarp provider
needs
> > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the
> > singlethread workq thread for iw_cm.
> >
> > Here is what happens with Sagi's patch:
> >
> > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> > rdma_disconnect().  This triggers the disconnect.  iw_cxgb4 posts the
> > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in the
> > iw_cm workq thread context.  cm_close_handler() calls the rdma_cm event
> handler
> > for this cm_id, function cm_iw_handler(), which blocks until any currently
> > running event handler for this cm_id finishes.  It does this by calling
> > cm_disable_callback().  However since this whole unplug process is running
in
> > the event handler function for this same cm_id, the iw_cm workq thread is
now
> > stuck in a deadlock.   nvme_rdma_device_unplug() however, continues on and
> > schedules the controller delete worker thread and waits for it to complete.
The
> > delete controller worker thread tries to disconnect and destroy all the
> > remaining IO queues, but gets stuck in the destroy() path on the first IO
queue
> > because the iw_cm workq thread is already stuck, and processing the CLOSE
> event
> > is required to release a reference the iw_cm has on the iwarp providers qp.
So
> > everything comes to a grinding halt....
> >
> > Now: Ming's 2 patches avoid this deadlock because the cm_id that received
the
> > device removal event is disconnected/destroyed _only after_ all the
controller
> > queues are disconnected/destroyed.  So nvme_rdma_device_unplug() doesn't get
> > stuck waiting for the controller to delete the io queues, and only after
that
> > completes, does it delete the cm_id/qp that got the device removal event.
It
> > then returns thus causing the rdma_cm to release the cm_id's callback mutex.
> > This causes the iw_cm workq thread to now unblock and we continue on.  (can
> you
> > say house of cards?)
> >
> > So the net is:  the cm_id that received the device remove event _must_ be
> > disconnect/destroyed _last_.
> 
> Hey Steve, thanks for the detailed description.
> 
> IMHO, this really looks like a buggy design in iWARP connection
> management implementation. The fact that a rdma_disconnect forward
> progress is dependent on other cm_id execution looks wrong and
> backwards to me.
> 

Yea.  I'm not sure how to fix it at this point...

> Given that the device is being removed altogether I don't think that
> calling rdma_disconnect is important at all. Given the fact that
> ib_drain_qp makes sure that the queue-pair is in error state does
> this incremental patch resolve the iwarp-deadlock you are seeing?
>

I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly.  When
I implemented them I assumed the QP would be disconnected.  Let me try this and
report back.
 
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index fd90b5c00aae..e7eec7d8c705 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1342,8 +1342,12 @@ static int nvme_rdma_device_unplug(struct
> nvme_rdma_queue *queue)
>                  goto queue_delete;
>          }
> 
> +       /*
> +        * iwcm does not handle rdma_disconnect within DEVICE_REMOVAL
> +        * event very well, so we settle with qp drain
> +        */
> +       ib_drain_qp(queue->qp);
>          /* Free this queue ourselves */
> -       nvme_rdma_stop_queue(queue);
>          nvme_rdma_destroy_queue_ib(queue);
> 
>          /* Return non-zero so the cm_id will destroy implicitly */
> --

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-18 14:55                 ` Steve Wise
@ 2016-07-18 15:47                   ` Steve Wise
  2016-07-18 16:34                     ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-18 15:47 UTC (permalink / raw)


> > > Hey Sagi, here is some lite reading for you. :)
> > >
> > > Prelude:  As part of disconnecting an iwarp connection, the iwarp provider
> needs
> > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the
> > > singlethread workq thread for iw_cm.
> > >
> > > Here is what happens with Sagi's patch:
> > >
> > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> > > rdma_disconnect().  This triggers the disconnect.  iw_cxgb4 posts the
> > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in
> the
> > > iw_cm workq thread context.  cm_close_handler() calls the rdma_cm event
> > handler
> > > for this cm_id, function cm_iw_handler(), which blocks until any currently
> > > running event handler for this cm_id finishes.  It does this by calling
> > > cm_disable_callback().  However since this whole unplug process is running
> in
> > > the event handler function for this same cm_id, the iw_cm workq thread is
> now
> > > stuck in a deadlock.   nvme_rdma_device_unplug() however, continues on and
> > > schedules the controller delete worker thread and waits for it to
complete.
> The
> > > delete controller worker thread tries to disconnect and destroy all the
> > > remaining IO queues, but gets stuck in the destroy() path on the first IO
> queue
> > > because the iw_cm workq thread is already stuck, and processing the CLOSE
> > event
> > > is required to release a reference the iw_cm has on the iwarp providers
qp.
> So
> > > everything comes to a grinding halt....
> > >
> > > Now: Ming's 2 patches avoid this deadlock because the cm_id that received
> the
> > > device removal event is disconnected/destroyed _only after_ all the
> controller
> > > queues are disconnected/destroyed.  So nvme_rdma_device_unplug() doesn't
> get
> > > stuck waiting for the controller to delete the io queues, and only after
> that
> > > completes, does it delete the cm_id/qp that got the device removal event.
> It
> > > then returns thus causing the rdma_cm to release the cm_id's callback
mutex.
> > > This causes the iw_cm workq thread to now unblock and we continue on.
(can
> > you
> > > say house of cards?)
> > >
> > > So the net is:  the cm_id that received the device remove event _must_ be
> > > disconnect/destroyed _last_.
> >
> > Hey Steve, thanks for the detailed description.
> >
> > IMHO, this really looks like a buggy design in iWARP connection
> > management implementation. The fact that a rdma_disconnect forward
> > progress is dependent on other cm_id execution looks wrong and
> > backwards to me.
> >
> 
> Yea.  I'm not sure how to fix it at this point...
> 
> > Given that the device is being removed altogether I don't think that
> > calling rdma_disconnect is important at all. Given the fact that
> > ib_drain_qp makes sure that the queue-pair is in error state does
> > this incremental patch resolve the iwarp-deadlock you are seeing?
> >
> 
> I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly.
When
> I implemented them I assumed the QP would be disconnected.  Let me try this
and
> report back.

Just doing the drain has the same issue because moving the QP to ERROR does the
same process as a disconnect wrt the provider/IWCM interactions.

The reason for this close provider/iwcm interaction is needed is due to the
IWARP relationship between QPs and the TCP connection (and thus the cm_id).  The
iwarp driver stack manipulates the QP state based on the connection events
directly. For example, the transition from IDLE -> RTS is done _only_ by the
iwarp driver stack.  Similarly, if the peer closes the TCP connection (or
aborts), then the local iwarp driver stack will move the QP out of RTS into
CLOSING or ERROR.  I think this is different from IB and the IBCM.  This in
connection with the IWCM maintaining a reference on the QP until the iwarp
driver posts a CLOSED event for its iw_cm_id is causing the problem here.  I'm
not justifying the existing logic, just trying to explain.

Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm to
fix this problem...  I guess the main problem is that qp disconnect and destroy
basically cannot be done in the event handler.  Granted it works if you only
destroy the qp associated with the cm_id handling the event, but that still
causes a deadlock, its just that the event handler can return and thus unblock
the deadlock.
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-18 15:47                   ` Steve Wise
@ 2016-07-18 16:34                     ` Steve Wise
  2016-07-18 18:04                       ` Steve Wise
  0 siblings, 1 reply; 26+ messages in thread
From: Steve Wise @ 2016-07-18 16:34 UTC (permalink / raw)


> > > > Hey Sagi, here is some lite reading for you. :)
> > > >
> > > > Prelude:  As part of disconnecting an iwarp connection, the iwarp
provider
> > needs
> > > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the
> > > > singlethread workq thread for iw_cm.
> > > >
> > > > Here is what happens with Sagi's patch:
> > > >
> > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> > > > rdma_disconnect().  This triggers the disconnect.  iw_cxgb4 posts the
> > > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in
> > the
> > > > iw_cm workq thread context.  cm_close_handler() calls the rdma_cm event
> > > handler
> > > > for this cm_id, function cm_iw_handler(), which blocks until any
currently
> > > > running event handler for this cm_id finishes.  It does this by calling
> > > > cm_disable_callback().  However since this whole unplug process is
running
> > in
> > > > the event handler function for this same cm_id, the iw_cm workq thread
is
> > now
> > > > stuck in a deadlock.   nvme_rdma_device_unplug() however, continues on
and
> > > > schedules the controller delete worker thread and waits for it to
> complete.
> > The
> > > > delete controller worker thread tries to disconnect and destroy all the
> > > > remaining IO queues, but gets stuck in the destroy() path on the first
IO
> > queue
> > > > because the iw_cm workq thread is already stuck, and processing the
CLOSE
> > > event
> > > > is required to release a reference the iw_cm has on the iwarp providers
> qp.
> > So
> > > > everything comes to a grinding halt....
> > > >
> > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that
received
> > the
> > > > device removal event is disconnected/destroyed _only after_ all the
> > controller
> > > > queues are disconnected/destroyed.  So nvme_rdma_device_unplug() doesn't
> > get
> > > > stuck waiting for the controller to delete the io queues, and only after
> > that
> > > > completes, does it delete the cm_id/qp that got the device removal
event.
> > It
> > > > then returns thus causing the rdma_cm to release the cm_id's callback
> mutex.
> > > > This causes the iw_cm workq thread to now unblock and we continue on.
> (can
> > > you
> > > > say house of cards?)
> > > >
> > > > So the net is:  the cm_id that received the device remove event _must_
be
> > > > disconnect/destroyed _last_.
> > >
> > > Hey Steve, thanks for the detailed description.
> > >
> > > IMHO, this really looks like a buggy design in iWARP connection
> > > management implementation. The fact that a rdma_disconnect forward
> > > progress is dependent on other cm_id execution looks wrong and
> > > backwards to me.
> > >
> >
> > Yea.  I'm not sure how to fix it at this point...
> >
> > > Given that the device is being removed altogether I don't think that
> > > calling rdma_disconnect is important at all. Given the fact that
> > > ib_drain_qp makes sure that the queue-pair is in error state does
> > > this incremental patch resolve the iwarp-deadlock you are seeing?
> > >
> >
> > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly.
> When
> > I implemented them I assumed the QP would be disconnected.  Let me try this
> and
> > report back.
> 
> Just doing the drain has the same issue because moving the QP to ERROR does
the
> same process as a disconnect wrt the provider/IWCM interactions.
> 
> The reason for this close provider/iwcm interaction is needed is due to the
> IWARP relationship between QPs and the TCP connection (and thus the cm_id).
The
> iwarp driver stack manipulates the QP state based on the connection events
> directly. For example, the transition from IDLE -> RTS is done _only_ by the
> iwarp driver stack.  Similarly, if the peer closes the TCP connection (or
> aborts), then the local iwarp driver stack will move the QP out of RTS into
> CLOSING or ERROR.  I think this is different from IB and the IBCM.  This in
> connection with the IWCM maintaining a reference on the QP until the iwarp
> driver posts a CLOSED event for its iw_cm_id is causing the problem here.  I'm
> not justifying the existing logic, just trying to explain.
> 
> Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm
> to
> fix this problem...  I guess the main problem is that qp disconnect and
destroy
> basically cannot be done in the event handler.  Granted it works if you only
> destroy the qp associated with the cm_id handling the event, but that still
> causes a deadlock, its just that the event handler can return and thus unblock
> the deadlock.

I think I can fix this in iw_cxgb4.  Stay tuned. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
  2016-07-18 16:34                     ` Steve Wise
@ 2016-07-18 18:04                       ` Steve Wise
  0 siblings, 0 replies; 26+ messages in thread
From: Steve Wise @ 2016-07-18 18:04 UTC (permalink / raw)


> > > > > Hey Sagi, here is some lite reading for you. :)
> > > > >
> > > > > Prelude:  As part of disconnecting an iwarp connection, the iwarp
> provider
> > > needs
> > > > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto
> the
> > > > > singlethread workq thread for iw_cm.
> > > > >
> > > > > Here is what happens with Sagi's patch:
> > > > >
> > > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls
> > > > > rdma_disconnect().  This triggers the disconnect.  iw_cxgb4 posts the
> > > > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler()
> in
> > > the
> > > > > iw_cm workq thread context.  cm_close_handler() calls the rdma_cm
event
> > > > handler
> > > > > for this cm_id, function cm_iw_handler(), which blocks until any
> currently
> > > > > running event handler for this cm_id finishes.  It does this by
calling
> > > > > cm_disable_callback().  However since this whole unplug process is
> running
> > > in
> > > > > the event handler function for this same cm_id, the iw_cm workq thread
> is
> > > now
> > > > > stuck in a deadlock.   nvme_rdma_device_unplug() however, continues on
> and
> > > > > schedules the controller delete worker thread and waits for it to
> > complete.
> > > The
> > > > > delete controller worker thread tries to disconnect and destroy all
the
> > > > > remaining IO queues, but gets stuck in the destroy() path on the first
> IO
> > > queue
> > > > > because the iw_cm workq thread is already stuck, and processing the
> CLOSE
> > > > event
> > > > > is required to release a reference the iw_cm has on the iwarp
providers
> > qp.
> > > So
> > > > > everything comes to a grinding halt....
> > > > >
> > > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that
> received
> > > the
> > > > > device removal event is disconnected/destroyed _only after_ all the
> > > controller
> > > > > queues are disconnected/destroyed.  So nvme_rdma_device_unplug()
> doesn't
> > > get
> > > > > stuck waiting for the controller to delete the io queues, and only
after
> > > that
> > > > > completes, does it delete the cm_id/qp that got the device removal
> event.
> > > It
> > > > > then returns thus causing the rdma_cm to release the cm_id's callback
> > mutex.
> > > > > This causes the iw_cm workq thread to now unblock and we continue on.
> > (can
> > > > you
> > > > > say house of cards?)
> > > > >
> > > > > So the net is:  the cm_id that received the device remove event _must_
> be
> > > > > disconnect/destroyed _last_.
> > > >
> > > > Hey Steve, thanks for the detailed description.
> > > >
> > > > IMHO, this really looks like a buggy design in iWARP connection
> > > > management implementation. The fact that a rdma_disconnect forward
> > > > progress is dependent on other cm_id execution looks wrong and
> > > > backwards to me.
> > > >
> > >
> > > Yea.  I'm not sure how to fix it at this point...
> > >
> > > > Given that the device is being removed altogether I don't think that
> > > > calling rdma_disconnect is important at all. Given the fact that
> > > > ib_drain_qp makes sure that the queue-pair is in error state does
> > > > this incremental patch resolve the iwarp-deadlock you are seeing?
> > > >
> > >
> > > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly.
> > When
> > > I implemented them I assumed the QP would be disconnected.  Let me try
this
> > and
> > > report back.
> >
> > Just doing the drain has the same issue because moving the QP to ERROR does
> the
> > same process as a disconnect wrt the provider/IWCM interactions.
> >
> > The reason for this close provider/iwcm interaction is needed is due to the
> > IWARP relationship between QPs and the TCP connection (and thus the cm_id).
> The
> > iwarp driver stack manipulates the QP state based on the connection events
> > directly. For example, the transition from IDLE -> RTS is done _only_ by the
> > iwarp driver stack.  Similarly, if the peer closes the TCP connection (or
> > aborts), then the local iwarp driver stack will move the QP out of RTS into
> > CLOSING or ERROR.  I think this is different from IB and the IBCM.  This in
> > connection with the IWCM maintaining a reference on the QP until the iwarp
> > driver posts a CLOSED event for its iw_cm_id is causing the problem here.
I'm
> > not justifying the existing logic, just trying to explain.
> >
> > Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm
> > to
> > fix this problem...  I guess the main problem is that qp disconnect and
> destroy
> > basically cannot be done in the event handler.  Granted it works if you only
> > destroy the qp associated with the cm_id handling the event, but that still
> > causes a deadlock, its just that the event handler can return and thus
unblock
> > the deadlock.
> 
> I think I can fix this in iw_cxgb4.  Stay tuned.


Nope.  Without refactoring to the IWCM (somehow), I don't know how to fix this.
I think we should go ahead with Ming's patches, perhaps add a FIXME comment, and
I can look into fix this IWCM.   Unless you have another approach that only
disconnects/flushes/destroys the cm_id/qp that received the device removal event
at the end of the handler upcall...

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-07-18 18:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin
2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
2016-07-13 21:33   ` Steve Wise
2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
2016-07-13 21:33   ` Steve Wise
2016-07-13 23:19   ` J Freyensee
2016-07-13 23:36     ` Ming Lin
2016-07-13 23:59       ` J Freyensee
2016-07-14  6:39         ` Ming Lin
2016-07-14 17:09           ` J Freyensee
2016-07-14 18:04             ` Ming Lin
2016-07-14  9:15   ` Sagi Grimberg
2016-07-14  9:17     ` Sagi Grimberg
2016-07-14 14:30       ` Steve Wise
2016-07-14 14:44         ` Sagi Grimberg
2016-07-14 14:59     ` Steve Wise
     [not found]     ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
2016-07-14 15:02       ` Steve Wise
2016-07-14 15:26         ` Steve Wise
2016-07-14 21:27           ` Steve Wise
2016-07-15 15:52             ` Steve Wise
2016-07-17  6:01               ` Sagi Grimberg
2016-07-18 14:55                 ` Steve Wise
2016-07-18 15:47                   ` Steve Wise
2016-07-18 16:34                     ` Steve Wise
2016-07-18 18:04                       ` Steve Wise
2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).