* [PATCH 0/2] nvme-rdma: device removal crash fixes
@ 2016-07-13 21:26 Ming Lin
2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Ming Lin @ 2016-07-13 21:26 UTC (permalink / raw)
From: Ming Lin <ming.l@samsung.com>
Here are 2 fixes for the device removal crash.
Steve,
Could you review and add tag if you feel OK?
Thanks again for your great debug.
Ming Lin (2):
nvme-rdma: grab reference for device removal event
nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
drivers/nvme/host/rdma.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
--
1.9.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 1/2] nvme-rdma: grab reference for device removal event 2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin @ 2016-07-13 21:26 ` Ming Lin 2016-07-13 21:33 ` Steve Wise 2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin 2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise 2 siblings, 1 reply; 26+ messages in thread From: Ming Lin @ 2016-07-13 21:26 UTC (permalink / raw) From: Ming Lin <ming.l@samsung.com> Below crash was triggered when shutting down a nvme host node via 'reboot' that has 1 target device attached. [ 88.897220] BUG: unable to handle kernel paging request at ffffebe00400f820 [ 88.905226] IP: [<ffffffff811e8d76>] kfree+0x56/0x170 [ 89.182264] Call Trace: [ 89.185899] [<ffffffffc09f7052>] nvme_rdma_free_ring.constprop.42+0x42/0xb0 [nvme_rdma] [ 89.195193] [<ffffffffc09f77ba>] nvme_rdma_destroy_queue_ib+0x3a/0x60 [nvme_rdma] [ 89.203969] [<ffffffffc09f92bc>] nvme_rdma_cm_handler+0x69c/0x8b6 [nvme_rdma] [ 89.212406] [<ffffffff811e859b>] ? __slab_free+0x9b/0x2b0 [ 89.219101] [<ffffffffc0a2c694>] cma_remove_one+0x1f4/0x220 [rdma_cm] [ 89.226838] [<ffffffffc09415b3>] ib_unregister_device+0xc3/0x160 [ib_core] [ 89.235008] [<ffffffffc0a0798a>] mlx4_ib_remove+0x6a/0x220 [mlx4_ib] [ 89.242656] [<ffffffffc097ede7>] mlx4_remove_device+0x97/0xb0 [mlx4_core] [ 89.250732] [<ffffffffc097f48e>] mlx4_unregister_device+0x3e/0xa0 [mlx4_core] [ 89.259151] [<ffffffffc0983a46>] mlx4_unload_one+0x86/0x2f0 [mlx4_core] [ 89.267049] [<ffffffffc0983d97>] mlx4_shutdown+0x57/0x70 [mlx4_core] [ 89.274680] [<ffffffff8141c4b6>] pci_device_shutdown+0x36/0x70 [ 89.281792] [<ffffffff81526c13>] device_shutdown+0xd3/0x180 [ 89.288638] [<ffffffff8109e556>] kernel_restart_prepare+0x36/0x40 [ 89.296003] [<ffffffff8109e602>] kernel_restart+0x12/0x60 [ 89.302688] [<ffffffff8109e983>] SYSC_reboot+0x1f3/0x220 [ 89.309266] [<ffffffff81186048>] ? __filemap_fdatawait_range+0xa8/0x150 [ 89.317151] [<ffffffff8123ec20>] ? fdatawait_one_bdev+0x20/0x20 [ 89.324335] [<ffffffff81188585>] ? __filemap_fdatawrite_range+0xb5/0xf0 [ 89.332227] [<ffffffff8122880a>] ? iput+0x8a/0x200 [ 89.338294] [<ffffffff8123ec00>] ? sync_inodes_one_sb+0x20/0x20 [ 89.345465] [<ffffffff812480d7>] ? iterate_bdevs+0x117/0x130 [ 89.352367] [<ffffffff8109ea0e>] SyS_reboot+0xe/0x10 Debug shows: [31948.771952] MYDEBUG: init kref: nvme_init_ctrl [31948.798589] MYDEBUG: get: nvme_rdma_create_ctrl [31948.803765] MYDEBUG: put: nvmf_dev_release [31948.808734] MYDEBUG: get: nvme_alloc_ns [31948.884775] MYDEBUG: put: nvme_free_ns [31948.890155] MYDEBUG in nvme_rdma_destroy_queue_ib: queue ffff8800cdc81470: io queue [31948.900539] MYDEBUG: put: nvme_rdma_del_ctrl_work [31948.909469] MYDEBUG: nvme_rdma_free_ctrl called [31948.915379] MYDEBUG in nvme_rdma_destroy_queue_ib: queue ffff8800cdc81400: admin queue So nvme_rdma_destroy_queue_ib() was called for admin queue after ctrl was already freed. Fixing it by get/put ctrl reference in nvme_rdma_device_unplug so the ctrl won't be freed before we free the last queue. Reported-by: Steve Wise <swise at opengridcomputing.com> Signed-off-by: Ming Lin <ming.l at samsung.com> --- drivers/nvme/host/rdma.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 3e3ce2b..f845304 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1331,6 +1331,12 @@ static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue) if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) goto out; + /* + * Grab a reference so the ctrl won't be freed before we free + * the last queue + */ + kref_get(&ctrl->ctrl.kref); + /* delete the controller */ ret = __nvme_rdma_del_ctrl(ctrl); if (!ret) { @@ -1347,6 +1353,8 @@ static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue) nvme_rdma_destroy_queue_ib(queue); } + nvme_put_ctrl(&ctrl->ctrl); + out: return ctrl_deleted; } -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 1/2] nvme-rdma: grab reference for device removal event 2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin @ 2016-07-13 21:33 ` Steve Wise 0 siblings, 0 replies; 26+ messages in thread From: Steve Wise @ 2016-07-13 21:33 UTC (permalink / raw) Looks good. Reviewed-by: Steve Wise <swise at opengridcomputing.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin 2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin @ 2016-07-13 21:26 ` Ming Lin 2016-07-13 21:33 ` Steve Wise ` (2 more replies) 2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise 2 siblings, 3 replies; 26+ messages in thread From: Ming Lin @ 2016-07-13 21:26 UTC (permalink / raw) From: Ming Lin <ming.l@samsung.com> Steve reported below crash when unloading iw_cxgb4. [59079.932154] nvme nvme1: Got rdma device removal event, deleting ctrl [59080.034208] BUG: unable to handle kernel paging request at ffff880f4e6c01f8 [59080.041972] IP: [<ffffffffa02e5a46>] __ib_process_cq+0x46/0xc0 [ib_core] [59080.049422] PGD 22a5067 PUD 10788d8067 PMD 1078864067 PTE 8000000f4e6c0060 [59080.057109] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [59080.164160] CPU: 0 PID: 14879 Comm: kworker/u64:2 Tainted: G E 4.7.0-rc2-block-for-next+ #78 [59080.174704] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015 [59080.182673] Workqueue: iw_cxgb4 process_work [iw_cxgb4] [59080.188924] task: ffff8810278646c0 ti: ffff880ff271c000 task.ti: ffff880ff271c000 [59080.197448] RIP: 0010:[<ffffffffa02e5a46>] [<ffffffffa02e5a46>] __ib_process_cq+0x46/0xc0 [ib_core] [59080.207647] RSP: 0018:ffff881036e03e48 EFLAGS: 00010282 [59080.214000] RAX: 0000000000000010 RBX: ffff8810203f3508 RCX: 0000000000000000 [59080.222194] RDX: ffff880f4e6c01f8 RSI: ffff880f4e6a1fe8 RDI: ffff8810203f3508 [59080.230393] RBP: ffff881036e03e88 R08: 0000000000000000 R09: 000000000000000c [59080.238598] R10: 0000000000000000 R11: 00000000000001f8 R12: 0000000000000020 [59080.246800] R13: 0000000000000100 R14: 0000000000000000 R15: 0000000000000000 [59080.255002] FS: 0000000000000000(0000) GS:ffff881036e00000(0000) knlGS:0000000000000000 [59080.264173] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [59080.271013] CR2: ffff880f4e6c01f8 CR3: 000000102105f000 CR4: 00000000000406f0 [59080.279258] Stack: [59080.282377] 0000000000000000 00000010fcddc1f8 0000000000000246 ffff8810203f3548 [59080.290979] ffff881036e13630 0000000000000100 ffff8810203f3508 ffff881036e03ed8 [59080.299587] ffff881036e03eb8 ffffffffa02e5e12 ffff8810203f3548 ffff881036e13630 [59080.308198] Call Trace: [59080.311779] <IRQ> [59080.313731] [<ffffffffa02e5e12>] ib_poll_handler+0x32/0x80 [ib_core] [59080.322653] [<ffffffff81395695>] irq_poll_softirq+0xa5/0xf0 [59080.329484] [<ffffffff816f186a>] __do_softirq+0xda/0x304 [59080.336047] [<ffffffff816f15b5>] ? do_IRQ+0x65/0xf0 [59080.342193] [<ffffffff816f08fc>] do_softirq_own_stack+0x1c/0x30 [59080.349381] <EOI> [59080.351351] [<ffffffff8109004e>] do_softirq+0x4e/0x50 [59080.359018] [<ffffffff81090127>] __local_bh_enable_ip+0x87/0x90 [59080.366178] [<ffffffffa081b837>] t4_ofld_send+0x127/0x180 [cxgb4] [59080.373499] [<ffffffffa08095ae>] cxgb4_remove_tid+0x9e/0x140 [cxgb4] [59080.381079] [<ffffffffa039235c>] _c4iw_free_ep+0x5c/0x100 [iw_cxgb4] [59080.388665] [<ffffffffa0396812>] peer_close+0x102/0x260 [iw_cxgb4] [59080.396082] [<ffffffffa039629a>] ? process_work+0x5a/0x70 [iw_cxgb4] [59080.403664] [<ffffffffa039629a>] ? process_work+0x5a/0x70 [iw_cxgb4] [59080.411254] [<ffffffff815c42c4>] ? __kfree_skb+0x34/0x80 [59080.417762] [<ffffffff815c4437>] ? kfree_skb+0x47/0xb0 [59080.424084] [<ffffffff815c24e7>] ? skb_dequeue+0x67/0x80 [59080.430569] [<ffffffffa039628e>] process_work+0x4e/0x70 [iw_cxgb4] [59080.437940] [<ffffffff810a4d03>] process_one_work+0x183/0x4d0 [59080.444862] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [59080.451373] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [59080.457506] [<ffffffff810a59bd>] worker_thread+0x16d/0x530 [59080.464056] [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0 [59080.470570] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [59080.476985] [<ffffffff810ccbc6>] ? __wake_up_common+0x56/0x90 [59080.483696] [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120 [59080.490824] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [59080.496808] [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120 [59080.503892] [<ffffffff810aa5dc>] kthread+0xcc/0xf0 [59080.509573] [<ffffffff810b4ffe>] ? schedule_tail+0x1e/0xc0 [59080.515928] [<ffffffff816eed3f>] ret_from_fork+0x1f/0x40 [59080.522093] [<ffffffff810aa510>] ? kthread_freezable_should_stop+0x70/0x70 Copy Steve's analysis: "I think the problem is nvme_destroy_admin_queue() is called as part of destroying the controller: nvme_rdma_del_ctrl_work() calls nvme_rdma_shutdown_ctrl() which calls nvme_rdma_destroy_admin_queue(). The admin nvme_rdma_queue doesn't get destroyed/freed, though, because the NVME_RDMA_Q_CONNECTED flag has already been cleared by nvme_rdma_device_unplug(). However nvme_destroy_admin_queue() _does_ free the tag set memory, which I believe contains the nvme_rdma_request structs that contain the ib_cqe struct, so when nvme_rdma_device_unplug() does finally flush the QP we crash..." Move the admin queue cleanup to nvme_rdma_free_ctrl() where we can make sure that the RDMA queue was already disconnected and drained and no code will access it any more. Reported-by: Steve Wise <swise at opengridcomputing.com> Signed-off-by: Ming Lin <ming.l at samsung.com> --- drivers/nvme/host/rdma.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index f845304..0d3c227 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -671,9 +671,6 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl) nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe, sizeof(struct nvme_command), DMA_TO_DEVICE); nvme_rdma_stop_and_free_queue(&ctrl->queues[0]); - blk_cleanup_queue(ctrl->ctrl.admin_q); - blk_mq_free_tag_set(&ctrl->admin_tag_set); - nvme_rdma_dev_put(ctrl->device); } static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) list_del(&ctrl->list); mutex_unlock(&nvme_rdma_ctrl_mutex); + blk_cleanup_queue(ctrl->ctrl.admin_q); + blk_mq_free_tag_set(&ctrl->admin_tag_set); + nvme_rdma_dev_put(ctrl->device); + if (ctrl->ctrl.tagset) { blk_cleanup_queue(ctrl->ctrl.connect_q); blk_mq_free_tag_set(&ctrl->tag_set); -- 1.9.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin @ 2016-07-13 21:33 ` Steve Wise 2016-07-13 23:19 ` J Freyensee 2016-07-14 9:15 ` Sagi Grimberg 2 siblings, 0 replies; 26+ messages in thread From: Steve Wise @ 2016-07-13 21:33 UTC (permalink / raw) Looks good. Reviewed-by: Steve Wise <swise at opengridcomputing.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin 2016-07-13 21:33 ` Steve Wise @ 2016-07-13 23:19 ` J Freyensee 2016-07-13 23:36 ` Ming Lin 2016-07-14 9:15 ` Sagi Grimberg 2 siblings, 1 reply; 26+ messages in thread From: J Freyensee @ 2016-07-13 23:19 UTC (permalink / raw) > static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl > *nctrl) > list_del(&ctrl->list); > mutex_unlock(&nvme_rdma_ctrl_mutex); > > + blk_cleanup_queue(ctrl->ctrl.admin_q); > + blk_mq_free_tag_set(&ctrl->admin_tag_set); > + nvme_rdma_dev_put(ctrl->device); > + > if (ctrl->ctrl.tagset) { > blk_cleanup_queue(ctrl->ctrl.connect_q); > blk_mq_free_tag_set(&ctrl->tag_set); This patch does not remove the second nvme_rdma_dev_put(ctrl->device); call that happens within the if() statement above if it evaluates to TRUE. Should that have been removed or moved elsewhere? ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 23:19 ` J Freyensee @ 2016-07-13 23:36 ` Ming Lin 2016-07-13 23:59 ` J Freyensee 0 siblings, 1 reply; 26+ messages in thread From: Ming Lin @ 2016-07-13 23:36 UTC (permalink / raw) On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee <james_p_freyensee@linux.intel.com> wrote: > >> static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) >> @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl >> *nctrl) >> list_del(&ctrl->list); >> mutex_unlock(&nvme_rdma_ctrl_mutex); >> >> + blk_cleanup_queue(ctrl->ctrl.admin_q); >> + blk_mq_free_tag_set(&ctrl->admin_tag_set); >> + nvme_rdma_dev_put(ctrl->device); >> + >> if (ctrl->ctrl.tagset) { >> blk_cleanup_queue(ctrl->ctrl.connect_q); >> blk_mq_free_tag_set(&ctrl->tag_set); > > This patch does not remove the second > > nvme_rdma_dev_put(ctrl->device); > > call that happens within the if() statement above if it evaluates to > TRUE. Should that have been removed or moved elsewhere? Not sure I understand your question. Did you mean line 694? For discovery controller, there is no IO queues. So ctrl->ctrl.tagset is NULL. The first bulk of "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for admin queue. And the second is for IO queues. 676 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) 677 { 678 struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl); 679 680 if (list_empty(&ctrl->list)) 681 goto free_ctrl; 682 683 mutex_lock(&nvme_rdma_ctrl_mutex); 684 list_del(&ctrl->list); 685 mutex_unlock(&nvme_rdma_ctrl_mutex); 686 687 blk_cleanup_queue(ctrl->ctrl.admin_q); 688 blk_mq_free_tag_set(&ctrl->admin_tag_set); 689 nvme_rdma_dev_put(ctrl->device); 690 691 if (ctrl->ctrl.tagset) { 692 blk_cleanup_queue(ctrl->ctrl.connect_q); 693 blk_mq_free_tag_set(&ctrl->tag_set); 694 nvme_rdma_dev_put(ctrl->device); 695 } 696 kfree(ctrl->queues); 697 nvmf_free_options(nctrl->opts); 698 free_ctrl: 699 kfree(ctrl); 700 } ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 23:36 ` Ming Lin @ 2016-07-13 23:59 ` J Freyensee 2016-07-14 6:39 ` Ming Lin 0 siblings, 1 reply; 26+ messages in thread From: J Freyensee @ 2016-07-13 23:59 UTC (permalink / raw) On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote: > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee > <james_p_freyensee@linux.intel.com> wrote: > > > > > static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct > > > nvme_ctrl > > > *nctrl) > > > list_del(&ctrl->list); > > > mutex_unlock(&nvme_rdma_ctrl_mutex); > > > > > > + blk_cleanup_queue(ctrl->ctrl.admin_q); > > > + blk_mq_free_tag_set(&ctrl->admin_tag_set); > > > + nvme_rdma_dev_put(ctrl->device); > > > + > > > if (ctrl->ctrl.tagset) { > > > blk_cleanup_queue(ctrl->ctrl.connect_q); > > > blk_mq_free_tag_set(&ctrl->tag_set); > > > > This patch does not remove the second > > > > nvme_rdma_dev_put(ctrl->device); > > > > call that happens within the if() statement above if it evaluates > > to > > TRUE. Should that have been removed or moved elsewhere? > > Not sure I understand your question. > Did you mean line 694? Yes I mean line 694. > > For discovery controller, there is no IO queues. So ctrl->ctrl.tagset > is NULL. > > The first bulk of > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for > admin > queue. > And the second is for IO queues. I'm just confused when nvme_free_ctrl() in core.c calls: ctrl->ops->free_ctrl(ctrl); which looks like would be the only call that would free both the admin and I/O rdma queues, why there would be the potential to do a _put() twice in nvme_rdma_free_ctrl() via: nvme_rdma_dev_put(ctrl->device); one for the admin section: 687 blk_cleanup_queue(ctrl>ctrl.admin_q); 688 blk_mq_free_tag_set(&ctrl->admin_tag_set); 689 nvme_rdma_dev_put(ctrl->device); and one for the I/O section (assuming "if (ctrl->ctrl.tagset)" evaluate s to TRUE in that call): 691 if (ctrl->ctrl.tagset) { 692 blk_cleanup_queue(ctrl->ctrl.connect_q); 693 blk_mq_free_tag_set(&ctrl->tag_set); 694 nvme_rdma_dev_put(ctrl->device); 695 } My assumption would be that the correct path for this case would be such that nvme_rdma_dev_put(ctrl->device); would only be called one time, for a single device. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 23:59 ` J Freyensee @ 2016-07-14 6:39 ` Ming Lin 2016-07-14 17:09 ` J Freyensee 0 siblings, 1 reply; 26+ messages in thread From: Ming Lin @ 2016-07-14 6:39 UTC (permalink / raw) On Wed, 2016-07-13@16:59 -0700, J Freyensee wrote: > On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote: > > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee > > <james_p_freyensee@linux.intel.com> wrote: > > > > > > > ?static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) > > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct > > > > nvme_ctrl > > > > *nctrl) > > > > ??????list_del(&ctrl->list); > > > > ??????mutex_unlock(&nvme_rdma_ctrl_mutex); > > > > > > > > +?????blk_cleanup_queue(ctrl->ctrl.admin_q); > > > > +?????blk_mq_free_tag_set(&ctrl->admin_tag_set); > > > > +?????nvme_rdma_dev_put(ctrl->device); > > > > + > > > > ??????if (ctrl->ctrl.tagset) { > > > > ??????????????blk_cleanup_queue(ctrl->ctrl.connect_q); > > > > ??????????????blk_mq_free_tag_set(&ctrl->tag_set); > > > > > > This patch does not remove the second > > > > > > nvme_rdma_dev_put(ctrl->device); > > > > > > call that happens within the if() statement above if it evaluates > > > to > > > TRUE.??Should that have been removed or moved elsewhere? > > > > Not sure I understand your question. > > Did you mean line 694? > > Yes I mean line 694. > > > > > For discovery controller, there is no IO queues. So ctrl- > > >ctrl.tagset > > is NULL. > > > > The first bulk of > > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for > > admin > > queue. > > And the second is for IO queues. > > I'm just confused when nvme_free_ctrl() in core.c calls: > > ctrl->ops->free_ctrl(ctrl); > > which looks like would be the only call that would free both the > admin > and I/O rdma queues, why there would be the potential to do a _put() > twice in nvme_rdma_free_ctrl() via: > > nvme_rdma_dev_put(ctrl->device); > > one for the admin section: > > 687?????????blk_cleanup_queue(ctrl>ctrl.admin_q); > 688?????????blk_mq_free_tag_set(&ctrl->admin_tag_set); > 689?????????nvme_rdma_dev_put(ctrl->device); This put paired with the get in?nvme_rdma_configure_admin_queue() > > and one for the I/O section (assuming "if (ctrl->ctrl.tagset)" > evaluate > s to TRUE in that call): > > 691?????????if (ctrl->ctrl.tagset) { > 692?????????????????blk_cleanup_queue(ctrl->ctrl.connect_q); > 693?????????????????blk_mq_free_tag_set(&ctrl->tag_set); > 694?????????????????nvme_rdma_dev_put(ctrl->device); This put paired with the get in?nvme_rdma_create_io_queues() > 695?????????} > > > My assumption would be that the correct path for this case would be > such that? > > nvme_rdma_dev_put(ctrl->device);? > > would only be called one time, for a single device. As above, need put for each get. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 6:39 ` Ming Lin @ 2016-07-14 17:09 ` J Freyensee 2016-07-14 18:04 ` Ming Lin 0 siblings, 1 reply; 26+ messages in thread From: J Freyensee @ 2016-07-14 17:09 UTC (permalink / raw) On Wed, 2016-07-13@23:39 -0700, Ming Lin wrote: > On Wed, 2016-07-13@16:59 -0700, J Freyensee wrote: > > On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote: > > > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee > > > <james_p_freyensee@linux.intel.com> wrote: > > > > > > > > > static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) > > > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct > > > > > nvme_ctrl > > > > > *nctrl) > > > > > list_del(&ctrl->list); > > > > > mutex_unlock(&nvme_rdma_ctrl_mutex); > > > > > > > > > > + blk_cleanup_queue(ctrl->ctrl.admin_q); > > > > > + blk_mq_free_tag_set(&ctrl->admin_tag_set); > > > > > + nvme_rdma_dev_put(ctrl->device); > > > > > + > > > > > if (ctrl->ctrl.tagset) { > > > > > blk_cleanup_queue(ctrl->ctrl.connect_q); > > > > > blk_mq_free_tag_set(&ctrl->tag_set); > > > > > > > > This patch does not remove the second > > > > > > > > nvme_rdma_dev_put(ctrl->device); > > > > > > > > call that happens within the if() statement above if it > > > > evaluates > > > > to > > > > TRUE. Should that have been removed or moved elsewhere? > > > > > > Not sure I understand your question. > > > Did you mean line 694? > > > > Yes I mean line 694. > > > > > > > > For discovery controller, there is no IO queues. So ctrl- > > > > ctrl.tagset > > > is NULL. > > > > > > The first bulk of > > > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for > > > admin > > > queue. > > > And the second is for IO queues. > > > > I'm just confused when nvme_free_ctrl() in core.c calls: > > > > ctrl->ops->free_ctrl(ctrl); > > > > which looks like would be the only call that would free both the > > admin > > and I/O rdma queues, why there would be the potential to do a > > _put() > > twice in nvme_rdma_free_ctrl() via: > > > > nvme_rdma_dev_put(ctrl->device); > > > > one for the admin section: > > > > 687 blk_cleanup_queue(ctrl>ctrl.admin_q); > > 688 blk_mq_free_tag_set(&ctrl->admin_tag_set); > > 689 nvme_rdma_dev_put(ctrl->device); > > This put paired with the get in nvme_rdma_configure_admin_queue() > > > > > and one for the I/O section (assuming "if (ctrl->ctrl.tagset)" > > evaluate > > s to TRUE in that call): > > > > 691 if (ctrl->ctrl.tagset) { > > 692 blk_cleanup_queue(ctrl->ctrl.connect_q); > > 693 blk_mq_free_tag_set(&ctrl->tag_set); > > 694 nvme_rdma_dev_put(ctrl->device); > > This put paired with the get in nvme_rdma_create_io_queues() > > > 695 } > > > > > > My assumption would be that the correct path for this case would be > > such that > > > > nvme_rdma_dev_put(ctrl->device); > > > > would only be called one time, for a single device. > > As above, need put for each get. OK, thanks for the clarification on this. Would it then be worth to have an if() check around the new admin logic, as the assumption with the added code is it will always need to free the admin_queue when called, something like: if (ctrl->admin_tag_set) { blk_cleanup_queue(ctrl->ctrl.admin_q); blk_mq_free_tag_set(&ctrl->admin_tag_set); nvme_rdma_dev_put(ctrl->device); } blk_mq_free_tag_set() does assume its parameter is not NULL. > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 17:09 ` J Freyensee @ 2016-07-14 18:04 ` Ming Lin 0 siblings, 0 replies; 26+ messages in thread From: Ming Lin @ 2016-07-14 18:04 UTC (permalink / raw) On Thu, Jul 14, 2016 at 10:09 AM, J Freyensee <james_p_freyensee@linux.intel.com> wrote: > On Wed, 2016-07-13@23:39 -0700, Ming Lin wrote: >> On Wed, 2016-07-13@16:59 -0700, J Freyensee wrote: >> > On Wed, 2016-07-13@16:36 -0700, Ming Lin wrote: >> > > On Wed, Jul 13, 2016 at 4:19 PM, J Freyensee >> > > <james_p_freyensee@linux.intel.com> wrote: >> > > > >> > > > > static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) >> > > > > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct >> > > > > nvme_ctrl >> > > > > *nctrl) >> > > > > list_del(&ctrl->list); >> > > > > mutex_unlock(&nvme_rdma_ctrl_mutex); >> > > > > >> > > > > + blk_cleanup_queue(ctrl->ctrl.admin_q); >> > > > > + blk_mq_free_tag_set(&ctrl->admin_tag_set); >> > > > > + nvme_rdma_dev_put(ctrl->device); >> > > > > + >> > > > > if (ctrl->ctrl.tagset) { >> > > > > blk_cleanup_queue(ctrl->ctrl.connect_q); >> > > > > blk_mq_free_tag_set(&ctrl->tag_set); >> > > > >> > > > This patch does not remove the second >> > > > >> > > > nvme_rdma_dev_put(ctrl->device); >> > > > >> > > > call that happens within the if() statement above if it >> > > > evaluates >> > > > to >> > > > TRUE. Should that have been removed or moved elsewhere? >> > > >> > > Not sure I understand your question. >> > > Did you mean line 694? >> > >> > Yes I mean line 694. >> > >> > > >> > > For discovery controller, there is no IO queues. So ctrl- >> > > > ctrl.tagset >> > > is NULL. >> > > >> > > The first bulk of >> > > "blk_cleanup_queue/blk_mq_free_tag_set/nvme_rdma_dev_put" is for >> > > admin >> > > queue. >> > > And the second is for IO queues. >> > >> > I'm just confused when nvme_free_ctrl() in core.c calls: >> > >> > ctrl->ops->free_ctrl(ctrl); >> > >> > which looks like would be the only call that would free both the >> > admin >> > and I/O rdma queues, why there would be the potential to do a >> > _put() >> > twice in nvme_rdma_free_ctrl() via: >> > >> > nvme_rdma_dev_put(ctrl->device); >> > >> > one for the admin section: >> > >> > 687 blk_cleanup_queue(ctrl>ctrl.admin_q); >> > 688 blk_mq_free_tag_set(&ctrl->admin_tag_set); >> > 689 nvme_rdma_dev_put(ctrl->device); >> >> This put paired with the get in nvme_rdma_configure_admin_queue() >> >> > >> > and one for the I/O section (assuming "if (ctrl->ctrl.tagset)" >> > evaluate >> > s to TRUE in that call): >> > >> > 691 if (ctrl->ctrl.tagset) { >> > 692 blk_cleanup_queue(ctrl->ctrl.connect_q); >> > 693 blk_mq_free_tag_set(&ctrl->tag_set); >> > 694 nvme_rdma_dev_put(ctrl->device); >> >> This put paired with the get in nvme_rdma_create_io_queues() >> >> > 695 } >> > >> > >> > My assumption would be that the correct path for this case would be >> > such that >> > >> > nvme_rdma_dev_put(ctrl->device); >> > >> > would only be called one time, for a single device. >> >> As above, need put for each get. > > OK, thanks for the clarification on this. > > Would it then be worth to have an if() check around the new admin > logic, as the assumption with the added code is it will always need to > free the admin_queue when called, something like: No, because admin_tag_set is always not NULL. > > if (ctrl->admin_tag_set) { > blk_cleanup_queue(ctrl->ctrl.admin_q); > blk_mq_free_tag_set(&ctrl->admin_tag_set); > nvme_rdma_dev_put(ctrl->device); > } > > blk_mq_free_tag_set() does assume its parameter is not NULL. Sagi's patch is better than these two. We'll go for it. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin 2016-07-13 21:33 ` Steve Wise 2016-07-13 23:19 ` J Freyensee @ 2016-07-14 9:15 ` Sagi Grimberg 2016-07-14 9:17 ` Sagi Grimberg ` (2 more replies) 2 siblings, 3 replies; 26+ messages in thread From: Sagi Grimberg @ 2016-07-14 9:15 UTC (permalink / raw) Hey Ming, Steve, > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index f845304..0d3c227 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -671,9 +671,6 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl) > nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe, > sizeof(struct nvme_command), DMA_TO_DEVICE); > nvme_rdma_stop_and_free_queue(&ctrl->queues[0]); > - blk_cleanup_queue(ctrl->ctrl.admin_q); > - blk_mq_free_tag_set(&ctrl->admin_tag_set); > - nvme_rdma_dev_put(ctrl->device); > } > > static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) > @@ -687,6 +684,10 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl) > list_del(&ctrl->list); > mutex_unlock(&nvme_rdma_ctrl_mutex); > > + blk_cleanup_queue(ctrl->ctrl.admin_q); > + blk_mq_free_tag_set(&ctrl->admin_tag_set); > + nvme_rdma_dev_put(ctrl->device); > + > if (ctrl->ctrl.tagset) { > blk_cleanup_queue(ctrl->ctrl.connect_q); > blk_mq_free_tag_set(&ctrl->tag_set); > This patch introduces asymmetry between create and destroy of the admin queue. Does this alternative patch solve the problem? The patch changes the order of device removal flow from: 1. delete controller 2. destroy queue to: 1. destroy queue 2. delete controller Or more specifically: 1. own the controller deletion (make sure we are not competing with anyone) 2. get rid of inflight reconnects (which also destroy and create queues) 3. destroy the queue 4. safely queue controller deletion Thoughts? -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 21ecbf3f3603..36cb4e33175a 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -169,7 +169,6 @@ MODULE_PARM_DESC(register_always, static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event); static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc); -static int __nvme_rdma_del_ctrl(struct nvme_rdma_ctrl *ctrl); /* XXX: really should move to a generic header sooner or later.. */ static inline void put_unaligned_le24(u32 val, u8 *p) @@ -1325,30 +1324,36 @@ out_destroy_queue_ib: static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue) { struct nvme_rdma_ctrl *ctrl = queue->ctrl; - int ret, ctrl_deleted = 0; + int ret; - /* First disable the queue so ctrl delete won't free it */ - if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) - goto out; + /* Own the controller deletion */ + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING)) + return 0; - /* delete the controller */ - ret = __nvme_rdma_del_ctrl(ctrl); - if (!ret) { - dev_warn(ctrl->ctrl.device, - "Got rdma device removal event, deleting ctrl\n"); - flush_work(&ctrl->delete_work); + dev_warn(ctrl->ctrl.device, + "Got rdma device removal event, deleting ctrl\n"); - /* Return non-zero so the cm_id will destroy implicitly */ - ctrl_deleted = 1; + /* Get rid of reconnect work if its running */ + cancel_delayed_work_sync(&ctrl->reconnect_work); - /* Free this queue ourselves */ - rdma_disconnect(queue->cm_id); - ib_drain_qp(queue->qp); - nvme_rdma_destroy_queue_ib(queue); + /* Disable the queue so ctrl delete won't free it */ + if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) { + ret = 0; + goto queue_delete; } -out: - return ctrl_deleted; + /* Free this queue ourselves */ + nvme_rdma_stop_queue(queue); + nvme_rdma_destroy_queue_ib(queue); + + /* Return non-zero so the cm_id will destroy implicitly */ + ret = 1; + +queue_delete: + /* queue controller deletion */ + queue_work(nvme_rdma_wq, &ctrl->delete_work); + flush_work(&ctrl->delete_work); + return ret; } static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, -- ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 9:15 ` Sagi Grimberg @ 2016-07-14 9:17 ` Sagi Grimberg 2016-07-14 14:30 ` Steve Wise 2016-07-14 14:59 ` Steve Wise [not found] ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com> 2 siblings, 1 reply; 26+ messages in thread From: Sagi Grimberg @ 2016-07-14 9:17 UTC (permalink / raw) > This patch introduces asymmetry between create and destroy > of the admin queue. Does this alternative patch solve > the problem? > > The patch changes the order of device removal flow from: > 1. delete controller > 2. destroy queue > > to: > 1. destroy queue > 2. delete controller > > Or more specifically: > 1. own the controller deletion (make sure we are not > competing with anyone) > 2. get rid of inflight reconnects (which also destroy and > create queues) > 3. destroy the queue > 4. safely queue controller deletion > > Thoughts? Should mention that patch 1 is not needed as well with this... ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 9:17 ` Sagi Grimberg @ 2016-07-14 14:30 ` Steve Wise 2016-07-14 14:44 ` Sagi Grimberg 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-14 14:30 UTC (permalink / raw) > -----Original Message----- > From: Sagi Grimberg [mailto:sagi at grimberg.me] > Sent: Thursday, July 14, 2016 4:18 AM > To: Ming Lin; linux-nvme at lists.infradead.org > Cc: Christoph Hellwig; Steve Wise; Jens Axboe > Subject: Re: [PATCH 2/2] nvme-rdma: move admin queue cleanup to > nvme_rdma_free_ctrl > > > > This patch introduces asymmetry between create and destroy > > of the admin queue. Does this alternative patch solve > > the problem? > > > > The patch changes the order of device removal flow from: > > 1. delete controller > > 2. destroy queue > > > > to: > > 1. destroy queue > > 2. delete controller > > > > Or more specifically: > > 1. own the controller deletion (make sure we are not > > competing with anyone) > > 2. get rid of inflight reconnects (which also destroy and > > create queues) > > 3. destroy the queue > > 4. safely queue controller deletion > > > > Thoughts? > > Should mention that patch 1 is not needed as well with this... Hey Sagi, This patch won't apply. I think it is munged in the email. Can you please email me the patch as an attachment? Thanks, Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 14:30 ` Steve Wise @ 2016-07-14 14:44 ` Sagi Grimberg 0 siblings, 0 replies; 26+ messages in thread From: Sagi Grimberg @ 2016-07-14 14:44 UTC (permalink / raw) > Hey Sagi, This patch won't apply. I think it is munged in the email. Can you > please email me the patch as an attachment? Attached -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-nvme-rdma-Fix-device-removal-handling.patch Type: text/x-patch Size: 3828 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160714/1ac4aaaa/attachment.bin> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 9:15 ` Sagi Grimberg 2016-07-14 9:17 ` Sagi Grimberg @ 2016-07-14 14:59 ` Steve Wise [not found] ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com> 2 siblings, 0 replies; 26+ messages in thread From: Steve Wise @ 2016-07-14 14:59 UTC (permalink / raw) > This patch introduces asymmetry between create and destroy > of the admin queue. Does this alternative patch solve > the problem? > > The patch changes the order of device removal flow from: > 1. delete controller > 2. destroy queue > > to: > 1. destroy queue > 2. delete controller > > Or more specifically: > 1. own the controller deletion (make sure we are not > competing with anyone) > 2. get rid of inflight reconnects (which also destroy and > create queues) > 3. destroy the queue > 4. safely queue controller deletion > > Thoughts? > Your patch causes a deadlock during device removal. The controller delete work thread is stuck in c4iw_destroy_qp waiting on all references to go away. Either nvmf/rdma or the rdma-cm or both. (gdb) list *c4iw_destroy_qp+0x155 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596). 1591 c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE, &attrs, 0); 1592 wait_event(qhp->wait, !qhp->ep); 1593 1594 remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid); 1595 atomic_dec(&qhp->refcnt); 1596 wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); 1597 1598 spin_lock_irq(&rhp->lock); 1599 if (!list_empty(&qhp->db_fc_entry)) 1600 list_del_init(&qhp->db_fc_entry); The rdma-cm work thread is stuck trying to grab the cm_id mutex: (gdb) list *cma_disable_callback+0x2e 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715). 710 711 static int cma_disable_callback(struct rdma_id_private *id_priv, 712 enum rdma_cm_state state) 713 { 714 mutex_lock(&id_priv->handler_mutex); 715 if (id_priv->state != state) { 716 mutex_unlock(&id_priv->handler_mutex); 717 return -EINVAL; 718 } 719 return 0; And the nvmf cm event handler is stuck waiting for the controller delete to finish: (gdb) list *nvme_rdma_device_unplug+0x97 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358). warning: Source file is more recent than executable. 1353 queue_delete: 1354 /* queue controller deletion */ 1355 queue_work(nvme_rdma_wq, &ctrl->delete_work); 1356 flush_work(&ctrl->delete_work); 1357 return ret; 1358 } 1359 1360 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, 1361 struct rdma_cm_event *ev) 1362 { Seems like the rdma-cm work thread is trying to grab the cm_id lock for the cm_id that is handling the DEVICE_REMOVAL event. Here is the detailed log: [ 1484.568354] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.1.14:4420 [ 1495.939850] nvme nvme1: creating 32 I/O queues. [ 1496.635114] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420 [ 1502.813698] nvme nvme1: Got rdma device removal event, deleting ctrl [ 1683.259465] INFO: task kworker/0:0:4 blocked for more than 120 seconds. [ 1683.266147] Tainted: G E 4.7.0-rc2-block-for-next+ #78 [ 1683.272986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1683.280830] kworker/0:0 D ffff8810281579c8 0 4 2 0x10000000 [ 1683.287997] Workqueue: nvme_rdma_wq nvme_rdma_del_ctrl_work [nvme_rdma] [ 1683.294658] ffff8810281579c8 ffffffff81c0d4c0 ffff88102809c6c0 ffff880fff04e898 [ 1683.302252] ffffffff81d448b4 ffff88102809d840 ffff881028154008 ffff88007853b3f8 [ 1683.309763] ffff881028157ac8 ffff880f67ea5758 ffff880f67ea5760 ffff881028157a18 [ 1683.317281] Call Trace: [ 1683.319752] [<ffffffff816eaed0>] schedule+0x40/0xb0 [ 1683.324757] [<ffffffffa036bac5>] c4iw_destroy_qp+0x155/0x2b0 [iw_cxgb4] [ 1683.331481] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [ 1683.336639] [<ffffffff810f515a>] ? lock_timer_base+0x5a/0x80 [ 1683.342402] [<ffffffff816ede49>] ? schedule_timeout+0x1c9/0x220 [ 1683.348438] [<ffffffff810ccd80>] ? wait_woken+0x80/0x80 [ 1683.353804] [<ffffffffa03a25a9>] ? ib_mr_pool_destroy+0x79/0x90 [ib_core] [ 1683.360699] [<ffffffffa039893e>] ib_destroy_qp+0xbe/0x160 [ib_core] [ 1683.367085] [<ffffffffa0706011>] rdma_destroy_qp+0x31/0x50 [rdma_cm] [ 1683.373538] [<ffffffffa038ed99>] nvme_rdma_destroy_queue_ib+0x29/0x90 [nvme_rdma] [ 1683.381127] [<ffffffffa038f04d>] nvme_rdma_stop_and_free_queue+0x2d/0x40 [nvme_rdma] [ 1683.388967] [<ffffffffa038f157>] nvme_rdma_shutdown_ctrl+0x87/0xe0 [nvme_rdma] [ 1683.396290] [<ffffffffa038f224>] nvme_rdma_del_ctrl_work+0x34/0x50 [nvme_rdma] [ 1683.403629] [<ffffffff810a4d03>] process_one_work+0x183/0x4d0 [ 1683.409485] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [ 1683.414984] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [ 1683.420142] [<ffffffff810c4130>] ? dequeue_entity+0x120/0x250 [ 1683.426009] [<ffffffff810a59bd>] worker_thread+0x16d/0x530 [ 1683.431597] [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0 [ 1683.437176] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [ 1683.442674] [<ffffffff810ccbc6>] ? __wake_up_common+0x56/0x90 [ 1683.448511] [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120 [ 1683.454803] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [ 1683.459951] [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120 [ 1683.466223] [<ffffffff810aa5dc>] kthread+0xcc/0xf0 [ 1683.471106] [<ffffffff810b4ffe>] ? schedule_tail+0x1e/0xc0 [ 1683.476699] [<ffffffff810ae3b0>] ? smpboot_create_threads+0x80/0x80 [ 1683.483053] [<ffffffff816eed3f>] ret_from_fork+0x1f/0x40 [ 1683.488470] [<ffffffff810aa510>] ? kthread_freezable_should_stop+0x70/0x70 [ 1683.495476] INFO: task kworker/u64:5:716 blocked for more than 120 seconds. [ 1683.502441] Tainted: G E 4.7.0-rc2-block-for-next+ #78 [ 1683.509237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1683.517096] kworker/u64:5 D ffff88101d1ff858 0 716 2 0x10000000 [ 1683.524223] Workqueue: iw_cm_wq cm_work_handler [iw_cm] [ 1683.529467] ffff88101d1ff858 ffff88102819c6c0 ffff88101d200040 ffff880fff04e898 [ 1683.536954] ffff88101d200040 ffff88101d200040 ffff88101d1fc008 ffff880f67a0720c [ 1683.544445] ffff88101d200040 ffff880f67a07210 ffff88101d1ff918 ffff88101d1ff8a8 [ 1683.551932] Call Trace: [ 1683.554390] [<ffffffff816eaed0>] schedule+0x40/0xb0 [ 1683.559358] [<ffffffff811119ae>] ? is_module_text_address+0xe/0x20 [ 1683.565630] [<ffffffff816eb12e>] schedule_preempt_disabled+0xe/0x10 [ 1683.571990] [<ffffffff816ecbe3>] __mutex_lock_slowpath+0x93/0x120 [ 1683.578200] [<ffffffff813789bb>] ? find_next_bit+0xb/0x10 [ 1683.583694] [<ffffffff81362816>] ? cpumask_next_and+0x26/0x50 [ 1683.589529] [<ffffffff810c4baa>] ? find_busiest_group+0x14a/0xa00 [ 1683.595713] [<ffffffff816ecc93>] mutex_lock+0x23/0x40 [ 1683.600871] [<ffffffffa070326e>] cma_disable_callback+0x2e/0x60 [rdma_cm] [ 1683.607760] [<ffffffffa0707b5b>] cma_iw_handler+0x2b/0x1b0 [rdma_cm] [ 1683.614207] [<ffffffffa06f61b3>] cm_close_handler+0x93/0xc0 [iw_cm] [ 1683.620561] [<ffffffffa06f80e7>] process_event+0xd7/0xf0 [iw_cm] [ 1683.626665] [<ffffffffa06f8247>] cm_work_handler+0x147/0x1c0 [iw_cm] [ 1683.633121] [<ffffffff810a54f6>] ? trace_event_raw_event_workqueue_execute_start+0x66/0xa0 [ 1683.641487] [<ffffffff810a4d03>] process_one_work+0x183/0x4d0 [ 1683.647334] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [ 1683.652831] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [ 1683.657977] [<ffffffff816eae91>] ? schedule+0x1/0xb0 [ 1683.663038] [<ffffffff810a59bd>] worker_thread+0x16d/0x530 [ 1683.668612] [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0 [ 1683.674197] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [ 1683.679690] [<ffffffff810ccbc6>] ? __wake_up_common+0x56/0x90 [ 1683.685534] [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120 [ 1683.691821] [<ffffffff816eaed0>] ? schedule+0x40/0xb0 [ 1683.696968] [<ffffffff810a5850>] ? maybe_create_worker+0x120/0x120 [ 1683.703230] [<ffffffff810aa5dc>] kthread+0xcc/0xf0 [ 1683.708126] [<ffffffff810b4ffe>] ? schedule_tail+0x1e/0xc0 [ 1683.713708] [<ffffffff816eed3f>] ret_from_fork+0x1f/0x40 [ 1683.719118] [<ffffffff810aa510>] ? kthread_freezable_should_stop+0x70/0x70 [ 1683.726126] INFO: task rmmod:6006 blocked for more than 120 seconds. [ 1683.732483] Tainted: G E 4.7.0-rc2-block-for-next+ #78 [ 1683.739278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1683.747111] rmmod D ffff880f647479f8 0 6006 4227 0x10000080 [ 1683.754210] ffff880f647479f8 ffff88102819af40 ffff880f647017c0 ffff880fff04e898 [ 1683.761729] ffffffff8116c656 0000000000000001 ffff880f64744008 ffff880f64747b98 [ 1683.769244] 7fffffffffffffff ffff880f647017c0 7fffffffffffffff ffff880f64747a48 [ 1683.776768] Call Trace: [ 1683.779229] [<ffffffff8116c656>] ? trace_event_buffer_commit+0x146/0x1d0 [ 1683.786037] [<ffffffff816eaed0>] schedule+0x40/0xb0 [ 1683.791009] [<ffffffff8102eb1d>] ? __switch_to+0x1cd/0x5e0 [ 1683.796585] [<ffffffff816ede49>] schedule_timeout+0x1c9/0x220 [ 1683.802430] [<ffffffff816eaa10>] ? __schedule+0x1f0/0x5b0 [ 1683.807922] [<ffffffff810a3603>] ? insert_work+0x53/0xb0 [ 1683.813341] [<ffffffff816ebede>] wait_for_completion+0xde/0x110 [ 1683.819386] [<ffffffff810b6b00>] ? try_to_wake_up+0x230/0x230 [ 1683.825236] [<ffffffff810a472b>] flush_work+0x2b/0x40 [ 1683.830379] [<ffffffff810a1810>] ? worker_detach_from_pool+0xa0/0xa0 [ 1683.836830] [<ffffffffa038eff7>] nvme_rdma_device_unplug+0x97/0xc0 [nvme_rdma] [ 1683.844142] [<ffffffffa0390782>] nvme_rdma_cm_handler+0x72/0x2f0 [nvme_rdma] [ 1683.851297] [<ffffffff812085c4>] ? kmem_cache_free+0x1f4/0x210 [ 1683.857230] [<ffffffffa0703669>] ? cma_comp+0x49/0x60 [rdma_cm] [ 1683.863247] [<ffffffffa070810f>] cma_remove_id_dev+0x8f/0xa0 [rdma_cm] [ 1683.869865] [<ffffffffa07081d7>] cma_process_remove+0xb7/0x100 [rdma_cm] [ 1683.876681] [<ffffffff812b2a14>] ? __kernfs_remove+0x114/0x1d0 [ 1683.882608] [<ffffffffa070825e>] cma_remove_one+0x3e/0x60 [rdma_cm] [ 1683.888983] [<ffffffffa039dca0>] ib_unregister_device+0xb0/0x150 [ib_core] [ 1683.895957] [<ffffffffa0363034>] c4iw_unregister_device+0x64/0x90 [iw_cxgb4] [ 1683.903109] [<ffffffffa0356357>] c4iw_remove+0x27/0x60 [iw_cxgb4] [ 1683.909303] [<ffffffffa036e6fd>] c4iw_exit_module+0x31/0x934 [iw_cxgb4] [ 1683.916018] [<ffffffff811120e3>] SyS_delete_module+0x183/0x1d0 [ 1683.921955] [<ffffffff81003476>] ? do_audit_syscall_entry+0x66/0x70 [ 1683.928314] [<ffffffff8100378c>] ? trace_event_raw_event_sys_exit+0x6c/0xa0 [ 1683.935357] [<ffffffff81003b45>] ? syscall_trace_enter+0x65/0x70 [ 1683.941501] [<ffffffff81003f08>] do_syscall_64+0x78/0x1d0 [ 1683.947003] [<ffffffff8106e367>] ? do_page_fault+0x37/0x90 [ 1683.952587] [<ffffffff816eebe1>] entry_SYSCALL64_slow_path+0x25/0x25 ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>]
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl [not found] ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com> @ 2016-07-14 15:02 ` Steve Wise 2016-07-14 15:26 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-14 15:02 UTC (permalink / raw) > > This patch introduces asymmetry between create and destroy > > of the admin queue. Does this alternative patch solve > > the problem? > > > > The patch changes the order of device removal flow from: > > 1. delete controller > > 2. destroy queue > > > > to: > > 1. destroy queue > > 2. delete controller > > > > Or more specifically: > > 1. own the controller deletion (make sure we are not > > competing with anyone) > > 2. get rid of inflight reconnects (which also destroy and > > create queues) > > 3. destroy the queue > > 4. safely queue controller deletion > > > > Thoughts? > > > > Your patch causes a deadlock during device removal. > > The controller delete work thread is stuck in c4iw_destroy_qp waiting on > all references to go away. Either nvmf/rdma or the rdma-cm or both. > > (gdb) list *c4iw_destroy_qp+0x155 > 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596). > 1591 c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE, > &attrs, 0); > 1592 wait_event(qhp->wait, !qhp->ep); > 1593 > 1594 remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid); > 1595 atomic_dec(&qhp->refcnt); > 1596 wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); > 1597 > 1598 spin_lock_irq(&rhp->lock); > 1599 if (!list_empty(&qhp->db_fc_entry)) > 1600 list_del_init(&qhp->db_fc_entry); > > The rdma-cm work thread is stuck trying to grab the cm_id mutex: > > (gdb) list *cma_disable_callback+0x2e > 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715). > 710 > 711 static int cma_disable_callback(struct rdma_id_private *id_priv, > 712 enum rdma_cm_state state) > 713 { > 714 mutex_lock(&id_priv->handler_mutex); > 715 if (id_priv->state != state) { > 716 mutex_unlock(&id_priv->handler_mutex); > 717 return -EINVAL; > 718 } > 719 return 0; > > And the nvmf cm event handler is stuck waiting for the controller delete > to finish: > > (gdb) list *nvme_rdma_device_unplug+0x97 > 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358). > warning: Source file is more recent than executable. > 1353 queue_delete: > 1354 /* queue controller deletion */ > 1355 queue_work(nvme_rdma_wq, &ctrl->delete_work); > 1356 flush_work(&ctrl->delete_work); > 1357 return ret; > 1358 } > 1359 > 1360 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, > 1361 struct rdma_cm_event *ev) > 1362 { > > Seems like the rdma-cm work thread is trying to grab the cm_id lock for > the cm_id that is handling the DEVICE_REMOVAL event. > And, the nvmf/rdma delete controller work thread is trying to delete the cm_id that received the DEVICE_REMOVAL event, which is the crux o' the biscuit, methinks... ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 15:02 ` Steve Wise @ 2016-07-14 15:26 ` Steve Wise 2016-07-14 21:27 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-14 15:26 UTC (permalink / raw) > > > This patch introduces asymmetry between create and destroy > > > of the admin queue. Does this alternative patch solve > > > the problem? > > > > > > The patch changes the order of device removal flow from: > > > 1. delete controller > > > 2. destroy queue > > > > > > to: > > > 1. destroy queue > > > 2. delete controller > > > > > > Or more specifically: > > > 1. own the controller deletion (make sure we are not > > > competing with anyone) > > > 2. get rid of inflight reconnects (which also destroy and > > > create queues) > > > 3. destroy the queue > > > 4. safely queue controller deletion > > > > > > Thoughts? > > > > > > > Your patch causes a deadlock during device removal. > > > > The controller delete work thread is stuck in c4iw_destroy_qp waiting on > > all references to go away. Either nvmf/rdma or the rdma-cm or both. > > > > (gdb) list *c4iw_destroy_qp+0x155 > > 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596). > > 1591 c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE, > > &attrs, 0); > > 1592 wait_event(qhp->wait, !qhp->ep); > > 1593 > > 1594 remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid); > > 1595 atomic_dec(&qhp->refcnt); > > 1596 wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); > > 1597 > > 1598 spin_lock_irq(&rhp->lock); > > 1599 if (!list_empty(&qhp->db_fc_entry)) > > 1600 list_del_init(&qhp->db_fc_entry); > > > > The rdma-cm work thread is stuck trying to grab the cm_id mutex: > > > > (gdb) list *cma_disable_callback+0x2e > > 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715). > > 710 > > 711 static int cma_disable_callback(struct rdma_id_private *id_priv, > > 712 enum rdma_cm_state state) > > 713 { > > 714 mutex_lock(&id_priv->handler_mutex); > > 715 if (id_priv->state != state) { > > 716 mutex_unlock(&id_priv->handler_mutex); > > 717 return -EINVAL; > > 718 } > > 719 return 0; > > > > And the nvmf cm event handler is stuck waiting for the controller delete > > to finish: > > > > (gdb) list *nvme_rdma_device_unplug+0x97 > > 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358). > > warning: Source file is more recent than executable. > > 1353 queue_delete: > > 1354 /* queue controller deletion */ > > 1355 queue_work(nvme_rdma_wq, &ctrl->delete_work); > > 1356 flush_work(&ctrl->delete_work); > > 1357 return ret; > > 1358 } > > 1359 > > 1360 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, > > 1361 struct rdma_cm_event *ev) > > 1362 { > > > > Seems like the rdma-cm work thread is trying to grab the cm_id lock for > > the cm_id that is handling the DEVICE_REMOVAL event. > > > > And, the nvmf/rdma delete controller work thread is trying to delete the cm_id > that received the DEVICE_REMOVAL event, which is the crux o' the biscuit, > methinks... > Correction: the del controller work thread is trying to destroy the qp associated with the cm_id. But the point is this cm_id/qp should NOT be touched by the del controller thread because the unplug thread should have cleared the Q_CONNECTED bit and thus took ownership of destroy it. I'll add some debug prints to see which path is being taken by nvme_rdma_device_unplug(). ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 15:26 ` Steve Wise @ 2016-07-14 21:27 ` Steve Wise 2016-07-15 15:52 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-14 21:27 UTC (permalink / raw) > > > > This patch introduces asymmetry between create and destroy > > > > of the admin queue. Does this alternative patch solve > > > > the problem? > > > > > > > > The patch changes the order of device removal flow from: > > > > 1. delete controller > > > > 2. destroy queue > > > > > > > > to: > > > > 1. destroy queue > > > > 2. delete controller > > > > > > > > Or more specifically: > > > > 1. own the controller deletion (make sure we are not > > > > competing with anyone) > > > > 2. get rid of inflight reconnects (which also destroy and > > > > create queues) > > > > 3. destroy the queue > > > > 4. safely queue controller deletion > > > > > > > > Thoughts? > > > > > > > > > > Your patch causes a deadlock during device removal. > > > > > > The controller delete work thread is stuck in c4iw_destroy_qp waiting on > > > all references to go away. Either nvmf/rdma or the rdma-cm or both. > > > > > > (gdb) list *c4iw_destroy_qp+0x155 > > > 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596). > > > 1591 c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE, > > > &attrs, 0); > > > 1592 wait_event(qhp->wait, !qhp->ep); > > > 1593 > > > 1594 remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid); > > > 1595 atomic_dec(&qhp->refcnt); > > > 1596 wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); > > > 1597 > > > 1598 spin_lock_irq(&rhp->lock); > > > 1599 if (!list_empty(&qhp->db_fc_entry)) > > > 1600 list_del_init(&qhp->db_fc_entry); > > > > > > The rdma-cm work thread is stuck trying to grab the cm_id mutex: > > > > > > (gdb) list *cma_disable_callback+0x2e > > > 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715). > > > 710 > > > 711 static int cma_disable_callback(struct rdma_id_private *id_priv, > > > 712 enum rdma_cm_state state) > > > 713 { > > > 714 mutex_lock(&id_priv->handler_mutex); > > > 715 if (id_priv->state != state) { > > > 716 mutex_unlock(&id_priv->handler_mutex); > > > 717 return -EINVAL; > > > 718 } > > > 719 return 0; > > > > > > And the nvmf cm event handler is stuck waiting for the controller delete > > > to finish: > > > > > > (gdb) list *nvme_rdma_device_unplug+0x97 > > > 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358). > > > warning: Source file is more recent than executable. > > > 1353 queue_delete: > > > 1354 /* queue controller deletion */ > > > 1355 queue_work(nvme_rdma_wq, &ctrl->delete_work); > > > 1356 flush_work(&ctrl->delete_work); > > > 1357 return ret; > > > 1358 } > > > 1359 > > > 1360 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, > > > 1361 struct rdma_cm_event *ev) > > > 1362 { > > > > > > Seems like the rdma-cm work thread is trying to grab the cm_id lock for > > > the cm_id that is handling the DEVICE_REMOVAL event. > > > > > > > And, the nvmf/rdma delete controller work thread is trying to delete the cm_id > > that received the DEVICE_REMOVAL event, which is the crux o' the biscuit, > > methinks... > > > > Correction: the del controller work thread is trying to destroy the qp > associated with the cm_id. But the point is this cm_id/qp should NOT be touched > by the del controller thread because the unplug thread should have cleared the > Q_CONNECTED bit and thus took ownership of destroy it. I'll add some debug > prints to see which path is being taken by nvme_rdma_device_unplug(). > After further debug, the del controller work thread is not trying to destroy the qp/cm_id that received the event. That qp/cm_id is successfully deleted by the unplug thread. However the first cm_id/qp that is destroyed by the del controller work thread gets stuck in c4iw_destroy_qp() due to the deadlock. So I need to understand more about the deadlock... Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-14 21:27 ` Steve Wise @ 2016-07-15 15:52 ` Steve Wise 2016-07-17 6:01 ` Sagi Grimberg 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-15 15:52 UTC (permalink / raw) > > Correction: the del controller work thread is trying to destroy the qp > > associated with the cm_id. But the point is this cm_id/qp should NOT be > touched > > by the del controller thread because the unplug thread should have cleared the > > Q_CONNECTED bit and thus took ownership of destroy it. I'll add some debug > > prints to see which path is being taken by nvme_rdma_device_unplug(). > > > > After further debug, the del controller work thread is not trying to destroy the > qp/cm_id that received the event. That qp/cm_id is successfully deleted by the > unplug thread. However the first cm_id/qp that is destroyed by the del > controller work thread gets stuck in c4iw_destroy_qp() due to the deadlock. So > I need to understand more about the deadlock... Hey Sagi, here is some lite reading for you. :) Prelude: As part of disconnecting an iwarp connection, the iwarp provider needs to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the singlethread workq thread for iw_cm. Here is what happens with Sagi's patch: nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in the iw_cm workq thread context. cm_close_handler() calls the rdma_cm event handler for this cm_id, function cm_iw_handler(), which blocks until any currently running event handler for this cm_id finishes. It does this by calling cm_disable_callback(). However since this whole unplug process is running in the event handler function for this same cm_id, the iw_cm workq thread is now stuck in a deadlock. nvme_rdma_device_unplug() however, continues on and schedules the controller delete worker thread and waits for it to complete. The delete controller worker thread tries to disconnect and destroy all the remaining IO queues, but gets stuck in the destroy() path on the first IO queue because the iw_cm workq thread is already stuck, and processing the CLOSE event is required to release a reference the iw_cm has on the iwarp providers qp. So everything comes to a grinding halt.... Now: Ming's 2 patches avoid this deadlock because the cm_id that received the device removal event is disconnected/destroyed _only after_ all the controller queues are disconnected/destroyed. So nvme_rdma_device_unplug() doesn't get stuck waiting for the controller to delete the io queues, and only after that completes, does it delete the cm_id/qp that got the device removal event. It then returns thus causing the rdma_cm to release the cm_id's callback mutex. This causes the iw_cm workq thread to now unblock and we continue on. (can you say house of cards?) So the net is: the cm_id that received the device remove event _must_ be disconnect/destroyed _last_. Steve. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-15 15:52 ` Steve Wise @ 2016-07-17 6:01 ` Sagi Grimberg 2016-07-18 14:55 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Sagi Grimberg @ 2016-07-17 6:01 UTC (permalink / raw) > Hey Sagi, here is some lite reading for you. :) > > Prelude: As part of disconnecting an iwarp connection, the iwarp provider needs > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the > singlethread workq thread for iw_cm. > > Here is what happens with Sagi's patch: > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in the > iw_cm workq thread context. cm_close_handler() calls the rdma_cm event handler > for this cm_id, function cm_iw_handler(), which blocks until any currently > running event handler for this cm_id finishes. It does this by calling > cm_disable_callback(). However since this whole unplug process is running in > the event handler function for this same cm_id, the iw_cm workq thread is now > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on and > schedules the controller delete worker thread and waits for it to complete. The > delete controller worker thread tries to disconnect and destroy all the > remaining IO queues, but gets stuck in the destroy() path on the first IO queue > because the iw_cm workq thread is already stuck, and processing the CLOSE event > is required to release a reference the iw_cm has on the iwarp providers qp. So > everything comes to a grinding halt.... > > Now: Ming's 2 patches avoid this deadlock because the cm_id that received the > device removal event is disconnected/destroyed _only after_ all the controller > queues are disconnected/destroyed. So nvme_rdma_device_unplug() doesn't get > stuck waiting for the controller to delete the io queues, and only after that > completes, does it delete the cm_id/qp that got the device removal event. It > then returns thus causing the rdma_cm to release the cm_id's callback mutex. > This causes the iw_cm workq thread to now unblock and we continue on. (can you > say house of cards?) > > So the net is: the cm_id that received the device remove event _must_ be > disconnect/destroyed _last_. Hey Steve, thanks for the detailed description. IMHO, this really looks like a buggy design in iWARP connection management implementation. The fact that a rdma_disconnect forward progress is dependent on other cm_id execution looks wrong and backwards to me. Given that the device is being removed altogether I don't think that calling rdma_disconnect is important at all. Given the fact that ib_drain_qp makes sure that the queue-pair is in error state does this incremental patch resolve the iwarp-deadlock you are seeing? -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index fd90b5c00aae..e7eec7d8c705 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1342,8 +1342,12 @@ static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue) goto queue_delete; } + /* + * iwcm does not handle rdma_disconnect within DEVICE_REMOVAL + * event very well, so we settle with qp drain + */ + ib_drain_qp(queue->qp); /* Free this queue ourselves */ - nvme_rdma_stop_queue(queue); nvme_rdma_destroy_queue_ib(queue); /* Return non-zero so the cm_id will destroy implicitly */ -- ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-17 6:01 ` Sagi Grimberg @ 2016-07-18 14:55 ` Steve Wise 2016-07-18 15:47 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-18 14:55 UTC (permalink / raw) > > Hey Sagi, here is some lite reading for you. :) > > > > Prelude: As part of disconnecting an iwarp connection, the iwarp provider needs > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the > > singlethread workq thread for iw_cm. > > > > Here is what happens with Sagi's patch: > > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls > > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in the > > iw_cm workq thread context. cm_close_handler() calls the rdma_cm event > handler > > for this cm_id, function cm_iw_handler(), which blocks until any currently > > running event handler for this cm_id finishes. It does this by calling > > cm_disable_callback(). However since this whole unplug process is running in > > the event handler function for this same cm_id, the iw_cm workq thread is now > > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on and > > schedules the controller delete worker thread and waits for it to complete. The > > delete controller worker thread tries to disconnect and destroy all the > > remaining IO queues, but gets stuck in the destroy() path on the first IO queue > > because the iw_cm workq thread is already stuck, and processing the CLOSE > event > > is required to release a reference the iw_cm has on the iwarp providers qp. So > > everything comes to a grinding halt.... > > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that received the > > device removal event is disconnected/destroyed _only after_ all the controller > > queues are disconnected/destroyed. So nvme_rdma_device_unplug() doesn't get > > stuck waiting for the controller to delete the io queues, and only after that > > completes, does it delete the cm_id/qp that got the device removal event. It > > then returns thus causing the rdma_cm to release the cm_id's callback mutex. > > This causes the iw_cm workq thread to now unblock and we continue on. (can > you > > say house of cards?) > > > > So the net is: the cm_id that received the device remove event _must_ be > > disconnect/destroyed _last_. > > Hey Steve, thanks for the detailed description. > > IMHO, this really looks like a buggy design in iWARP connection > management implementation. The fact that a rdma_disconnect forward > progress is dependent on other cm_id execution looks wrong and > backwards to me. > Yea. I'm not sure how to fix it at this point... > Given that the device is being removed altogether I don't think that > calling rdma_disconnect is important at all. Given the fact that > ib_drain_qp makes sure that the queue-pair is in error state does > this incremental patch resolve the iwarp-deadlock you are seeing? > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly. When I implemented them I assumed the QP would be disconnected. Let me try this and report back. > -- > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index fd90b5c00aae..e7eec7d8c705 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -1342,8 +1342,12 @@ static int nvme_rdma_device_unplug(struct > nvme_rdma_queue *queue) > goto queue_delete; > } > > + /* > + * iwcm does not handle rdma_disconnect within DEVICE_REMOVAL > + * event very well, so we settle with qp drain > + */ > + ib_drain_qp(queue->qp); > /* Free this queue ourselves */ > - nvme_rdma_stop_queue(queue); > nvme_rdma_destroy_queue_ib(queue); > > /* Return non-zero so the cm_id will destroy implicitly */ > -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-18 14:55 ` Steve Wise @ 2016-07-18 15:47 ` Steve Wise 2016-07-18 16:34 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-18 15:47 UTC (permalink / raw) > > > Hey Sagi, here is some lite reading for you. :) > > > > > > Prelude: As part of disconnecting an iwarp connection, the iwarp provider > needs > > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the > > > singlethread workq thread for iw_cm. > > > > > > Here is what happens with Sagi's patch: > > > > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls > > > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the > > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in > the > > > iw_cm workq thread context. cm_close_handler() calls the rdma_cm event > > handler > > > for this cm_id, function cm_iw_handler(), which blocks until any currently > > > running event handler for this cm_id finishes. It does this by calling > > > cm_disable_callback(). However since this whole unplug process is running > in > > > the event handler function for this same cm_id, the iw_cm workq thread is > now > > > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on and > > > schedules the controller delete worker thread and waits for it to complete. > The > > > delete controller worker thread tries to disconnect and destroy all the > > > remaining IO queues, but gets stuck in the destroy() path on the first IO > queue > > > because the iw_cm workq thread is already stuck, and processing the CLOSE > > event > > > is required to release a reference the iw_cm has on the iwarp providers qp. > So > > > everything comes to a grinding halt.... > > > > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that received > the > > > device removal event is disconnected/destroyed _only after_ all the > controller > > > queues are disconnected/destroyed. So nvme_rdma_device_unplug() doesn't > get > > > stuck waiting for the controller to delete the io queues, and only after > that > > > completes, does it delete the cm_id/qp that got the device removal event. > It > > > then returns thus causing the rdma_cm to release the cm_id's callback mutex. > > > This causes the iw_cm workq thread to now unblock and we continue on. (can > > you > > > say house of cards?) > > > > > > So the net is: the cm_id that received the device remove event _must_ be > > > disconnect/destroyed _last_. > > > > Hey Steve, thanks for the detailed description. > > > > IMHO, this really looks like a buggy design in iWARP connection > > management implementation. The fact that a rdma_disconnect forward > > progress is dependent on other cm_id execution looks wrong and > > backwards to me. > > > > Yea. I'm not sure how to fix it at this point... > > > Given that the device is being removed altogether I don't think that > > calling rdma_disconnect is important at all. Given the fact that > > ib_drain_qp makes sure that the queue-pair is in error state does > > this incremental patch resolve the iwarp-deadlock you are seeing? > > > > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly. When > I implemented them I assumed the QP would be disconnected. Let me try this and > report back. Just doing the drain has the same issue because moving the QP to ERROR does the same process as a disconnect wrt the provider/IWCM interactions. The reason for this close provider/iwcm interaction is needed is due to the IWARP relationship between QPs and the TCP connection (and thus the cm_id). The iwarp driver stack manipulates the QP state based on the connection events directly. For example, the transition from IDLE -> RTS is done _only_ by the iwarp driver stack. Similarly, if the peer closes the TCP connection (or aborts), then the local iwarp driver stack will move the QP out of RTS into CLOSING or ERROR. I think this is different from IB and the IBCM. This in connection with the IWCM maintaining a reference on the QP until the iwarp driver posts a CLOSED event for its iw_cm_id is causing the problem here. I'm not justifying the existing logic, just trying to explain. Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm to fix this problem... I guess the main problem is that qp disconnect and destroy basically cannot be done in the event handler. Granted it works if you only destroy the qp associated with the cm_id handling the event, but that still causes a deadlock, its just that the event handler can return and thus unblock the deadlock. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-18 15:47 ` Steve Wise @ 2016-07-18 16:34 ` Steve Wise 2016-07-18 18:04 ` Steve Wise 0 siblings, 1 reply; 26+ messages in thread From: Steve Wise @ 2016-07-18 16:34 UTC (permalink / raw) > > > > Hey Sagi, here is some lite reading for you. :) > > > > > > > > Prelude: As part of disconnecting an iwarp connection, the iwarp provider > > needs > > > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto the > > > > singlethread workq thread for iw_cm. > > > > > > > > Here is what happens with Sagi's patch: > > > > > > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls > > > > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the > > > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() in > > the > > > > iw_cm workq thread context. cm_close_handler() calls the rdma_cm event > > > handler > > > > for this cm_id, function cm_iw_handler(), which blocks until any currently > > > > running event handler for this cm_id finishes. It does this by calling > > > > cm_disable_callback(). However since this whole unplug process is running > > in > > > > the event handler function for this same cm_id, the iw_cm workq thread is > > now > > > > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on and > > > > schedules the controller delete worker thread and waits for it to > complete. > > The > > > > delete controller worker thread tries to disconnect and destroy all the > > > > remaining IO queues, but gets stuck in the destroy() path on the first IO > > queue > > > > because the iw_cm workq thread is already stuck, and processing the CLOSE > > > event > > > > is required to release a reference the iw_cm has on the iwarp providers > qp. > > So > > > > everything comes to a grinding halt.... > > > > > > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that received > > the > > > > device removal event is disconnected/destroyed _only after_ all the > > controller > > > > queues are disconnected/destroyed. So nvme_rdma_device_unplug() doesn't > > get > > > > stuck waiting for the controller to delete the io queues, and only after > > that > > > > completes, does it delete the cm_id/qp that got the device removal event. > > It > > > > then returns thus causing the rdma_cm to release the cm_id's callback > mutex. > > > > This causes the iw_cm workq thread to now unblock and we continue on. > (can > > > you > > > > say house of cards?) > > > > > > > > So the net is: the cm_id that received the device remove event _must_ be > > > > disconnect/destroyed _last_. > > > > > > Hey Steve, thanks for the detailed description. > > > > > > IMHO, this really looks like a buggy design in iWARP connection > > > management implementation. The fact that a rdma_disconnect forward > > > progress is dependent on other cm_id execution looks wrong and > > > backwards to me. > > > > > > > Yea. I'm not sure how to fix it at this point... > > > > > Given that the device is being removed altogether I don't think that > > > calling rdma_disconnect is important at all. Given the fact that > > > ib_drain_qp makes sure that the queue-pair is in error state does > > > this incremental patch resolve the iwarp-deadlock you are seeing? > > > > > > > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly. > When > > I implemented them I assumed the QP would be disconnected. Let me try this > and > > report back. > > Just doing the drain has the same issue because moving the QP to ERROR does the > same process as a disconnect wrt the provider/IWCM interactions. > > The reason for this close provider/iwcm interaction is needed is due to the > IWARP relationship between QPs and the TCP connection (and thus the cm_id). The > iwarp driver stack manipulates the QP state based on the connection events > directly. For example, the transition from IDLE -> RTS is done _only_ by the > iwarp driver stack. Similarly, if the peer closes the TCP connection (or > aborts), then the local iwarp driver stack will move the QP out of RTS into > CLOSING or ERROR. I think this is different from IB and the IBCM. This in > connection with the IWCM maintaining a reference on the QP until the iwarp > driver posts a CLOSED event for its iw_cm_id is causing the problem here. I'm > not justifying the existing logic, just trying to explain. > > Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm > to > fix this problem... I guess the main problem is that qp disconnect and destroy > basically cannot be done in the event handler. Granted it works if you only > destroy the qp associated with the cm_id handling the event, but that still > causes a deadlock, its just that the event handler can return and thus unblock > the deadlock. I think I can fix this in iw_cxgb4. Stay tuned. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl 2016-07-18 16:34 ` Steve Wise @ 2016-07-18 18:04 ` Steve Wise 0 siblings, 0 replies; 26+ messages in thread From: Steve Wise @ 2016-07-18 18:04 UTC (permalink / raw) > > > > > Hey Sagi, here is some lite reading for you. :) > > > > > > > > > > Prelude: As part of disconnecting an iwarp connection, the iwarp > provider > > > needs > > > > > to post an IW_CM_EVENT_CLOSE event to iw_cm, which is scheduled onto > the > > > > > singlethread workq thread for iw_cm. > > > > > > > > > > Here is what happens with Sagi's patch: > > > > > > > > > > nvme_rdma_device_unplug() calls nvme_rdma_stop_queue() which calls > > > > > rdma_disconnect(). This triggers the disconnect. iw_cxgb4 posts the > > > > > IW_CM_EVENT_CLOSE to iw_cm, which ends up calling cm_close_handler() > in > > > the > > > > > iw_cm workq thread context. cm_close_handler() calls the rdma_cm event > > > > handler > > > > > for this cm_id, function cm_iw_handler(), which blocks until any > currently > > > > > running event handler for this cm_id finishes. It does this by calling > > > > > cm_disable_callback(). However since this whole unplug process is > running > > > in > > > > > the event handler function for this same cm_id, the iw_cm workq thread > is > > > now > > > > > stuck in a deadlock. nvme_rdma_device_unplug() however, continues on > and > > > > > schedules the controller delete worker thread and waits for it to > > complete. > > > The > > > > > delete controller worker thread tries to disconnect and destroy all the > > > > > remaining IO queues, but gets stuck in the destroy() path on the first > IO > > > queue > > > > > because the iw_cm workq thread is already stuck, and processing the > CLOSE > > > > event > > > > > is required to release a reference the iw_cm has on the iwarp providers > > qp. > > > So > > > > > everything comes to a grinding halt.... > > > > > > > > > > Now: Ming's 2 patches avoid this deadlock because the cm_id that > received > > > the > > > > > device removal event is disconnected/destroyed _only after_ all the > > > controller > > > > > queues are disconnected/destroyed. So nvme_rdma_device_unplug() > doesn't > > > get > > > > > stuck waiting for the controller to delete the io queues, and only after > > > that > > > > > completes, does it delete the cm_id/qp that got the device removal > event. > > > It > > > > > then returns thus causing the rdma_cm to release the cm_id's callback > > mutex. > > > > > This causes the iw_cm workq thread to now unblock and we continue on. > > (can > > > > you > > > > > say house of cards?) > > > > > > > > > > So the net is: the cm_id that received the device remove event _must_ > be > > > > > disconnect/destroyed _last_. > > > > > > > > Hey Steve, thanks for the detailed description. > > > > > > > > IMHO, this really looks like a buggy design in iWARP connection > > > > management implementation. The fact that a rdma_disconnect forward > > > > progress is dependent on other cm_id execution looks wrong and > > > > backwards to me. > > > > > > > > > > Yea. I'm not sure how to fix it at this point... > > > > > > > Given that the device is being removed altogether I don't think that > > > > calling rdma_disconnect is important at all. Given the fact that > > > > ib_drain_qp makes sure that the queue-pair is in error state does > > > > this incremental patch resolve the iwarp-deadlock you are seeing? > > > > > > > > > > I'll need to change c4iw_drain_sq/rq() to move the QP to ERROR explicitly. > > When > > > I implemented them I assumed the QP would be disconnected. Let me try this > > and > > > report back. > > > > Just doing the drain has the same issue because moving the QP to ERROR does > the > > same process as a disconnect wrt the provider/IWCM interactions. > > > > The reason for this close provider/iwcm interaction is needed is due to the > > IWARP relationship between QPs and the TCP connection (and thus the cm_id). > The > > iwarp driver stack manipulates the QP state based on the connection events > > directly. For example, the transition from IDLE -> RTS is done _only_ by the > > iwarp driver stack. Similarly, if the peer closes the TCP connection (or > > aborts), then the local iwarp driver stack will move the QP out of RTS into > > CLOSING or ERROR. I think this is different from IB and the IBCM. This in > > connection with the IWCM maintaining a reference on the QP until the iwarp > > driver posts a CLOSED event for its iw_cm_id is causing the problem here. I'm > > not justifying the existing logic, just trying to explain. > > > > Perhaps we can explore changing the iw_cm and the iWARP parts of the rdma_cm > > to > > fix this problem... I guess the main problem is that qp disconnect and > destroy > > basically cannot be done in the event handler. Granted it works if you only > > destroy the qp associated with the cm_id handling the event, but that still > > causes a deadlock, its just that the event handler can return and thus unblock > > the deadlock. > > I think I can fix this in iw_cxgb4. Stay tuned. Nope. Without refactoring to the IWCM (somehow), I don't know how to fix this. I think we should go ahead with Ming's patches, perhaps add a FIXME comment, and I can look into fix this IWCM. Unless you have another approach that only disconnects/flushes/destroys the cm_id/qp that received the device removal event at the end of the handler upcall... ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 0/2] nvme-rdma: device removal crash fixes 2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin 2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin 2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin @ 2016-07-13 21:58 ` Steve Wise 2 siblings, 0 replies; 26+ messages in thread From: Steve Wise @ 2016-07-13 21:58 UTC (permalink / raw) > From: Ming Lin <ming.l at samsung.com> > > Here are 2 fixes for the device removal crash. > > Steve, > > Could you review and add tag if you feel OK? > Thanks again for your great debug. Thanks for the quick fixes. The series tests good. I verified no crashing on module unload/shutdown using both cxgb4 and mlx4. Tested-by: Steve Wise <swise at opengridcomputing.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2016-07-18 18:04 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin
2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
2016-07-13 21:33 ` Steve Wise
2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
2016-07-13 21:33 ` Steve Wise
2016-07-13 23:19 ` J Freyensee
2016-07-13 23:36 ` Ming Lin
2016-07-13 23:59 ` J Freyensee
2016-07-14 6:39 ` Ming Lin
2016-07-14 17:09 ` J Freyensee
2016-07-14 18:04 ` Ming Lin
2016-07-14 9:15 ` Sagi Grimberg
2016-07-14 9:17 ` Sagi Grimberg
2016-07-14 14:30 ` Steve Wise
2016-07-14 14:44 ` Sagi Grimberg
2016-07-14 14:59 ` Steve Wise
[not found] ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
2016-07-14 15:02 ` Steve Wise
2016-07-14 15:26 ` Steve Wise
2016-07-14 21:27 ` Steve Wise
2016-07-15 15:52 ` Steve Wise
2016-07-17 6:01 ` Sagi Grimberg
2016-07-18 14:55 ` Steve Wise
2016-07-18 15:47 ` Steve Wise
2016-07-18 16:34 ` Steve Wise
2016-07-18 18:04 ` Steve Wise
2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).