From: swise@opengridcomputing.com (Steve Wise)
Subject: [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl
Date: Thu, 14 Jul 2016 10:02:53 -0500 [thread overview]
Message-ID: <011c01d1dde0$cc2f74d0$648e5e70$@opengridcomputing.com> (raw)
In-Reply-To: <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
> > This patch introduces asymmetry between create and destroy
> > of the admin queue. Does this alternative patch solve
> > the problem?
> >
> > The patch changes the order of device removal flow from:
> > 1. delete controller
> > 2. destroy queue
> >
> > to:
> > 1. destroy queue
> > 2. delete controller
> >
> > Or more specifically:
> > 1. own the controller deletion (make sure we are not
> > competing with anyone)
> > 2. get rid of inflight reconnects (which also destroy and
> > create queues)
> > 3. destroy the queue
> > 4. safely queue controller deletion
> >
> > Thoughts?
> >
>
> Your patch causes a deadlock during device removal.
>
> The controller delete work thread is stuck in c4iw_destroy_qp waiting on
> all references to go away. Either nvmf/rdma or the rdma-cm or both.
>
> (gdb) list *c4iw_destroy_qp+0x155
> 0x15af5 is in c4iw_destroy_qp (drivers/infiniband/hw/cxgb4/qp.c:1596).
> 1591 c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE,
> &attrs, 0);
> 1592 wait_event(qhp->wait, !qhp->ep);
> 1593
> 1594 remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
> 1595 atomic_dec(&qhp->refcnt);
> 1596 wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
> 1597
> 1598 spin_lock_irq(&rhp->lock);
> 1599 if (!list_empty(&qhp->db_fc_entry))
> 1600 list_del_init(&qhp->db_fc_entry);
>
> The rdma-cm work thread is stuck trying to grab the cm_id mutex:
>
> (gdb) list *cma_disable_callback+0x2e
> 0x29e is in cma_disable_callback (drivers/infiniband/core/cma.c:715).
> 710
> 711 static int cma_disable_callback(struct rdma_id_private *id_priv,
> 712 enum rdma_cm_state state)
> 713 {
> 714 mutex_lock(&id_priv->handler_mutex);
> 715 if (id_priv->state != state) {
> 716 mutex_unlock(&id_priv->handler_mutex);
> 717 return -EINVAL;
> 718 }
> 719 return 0;
>
> And the nvmf cm event handler is stuck waiting for the controller delete
> to finish:
>
> (gdb) list *nvme_rdma_device_unplug+0x97
> 0x1027 is in nvme_rdma_device_unplug (drivers/nvme/host/rdma.c:1358).
> warning: Source file is more recent than executable.
> 1353 queue_delete:
> 1354 /* queue controller deletion */
> 1355 queue_work(nvme_rdma_wq, &ctrl->delete_work);
> 1356 flush_work(&ctrl->delete_work);
> 1357 return ret;
> 1358 }
> 1359
> 1360 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> 1361 struct rdma_cm_event *ev)
> 1362 {
>
> Seems like the rdma-cm work thread is trying to grab the cm_id lock for
> the cm_id that is handling the DEVICE_REMOVAL event.
>
And, the nvmf/rdma delete controller work thread is trying to delete the cm_id
that received the DEVICE_REMOVAL event, which is the crux o' the biscuit,
methinks...
next prev parent reply other threads:[~2016-07-14 15:02 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-13 21:26 [PATCH 0/2] nvme-rdma: device removal crash fixes Ming Lin
2016-07-13 21:26 ` [PATCH 1/2] nvme-rdma: grab reference for device removal event Ming Lin
2016-07-13 21:33 ` Steve Wise
2016-07-13 21:26 ` [PATCH 2/2] nvme-rdma: move admin queue cleanup to nvme_rdma_free_ctrl Ming Lin
2016-07-13 21:33 ` Steve Wise
2016-07-13 23:19 ` J Freyensee
2016-07-13 23:36 ` Ming Lin
2016-07-13 23:59 ` J Freyensee
2016-07-14 6:39 ` Ming Lin
2016-07-14 17:09 ` J Freyensee
2016-07-14 18:04 ` Ming Lin
2016-07-14 9:15 ` Sagi Grimberg
2016-07-14 9:17 ` Sagi Grimberg
2016-07-14 14:30 ` Steve Wise
2016-07-14 14:44 ` Sagi Grimberg
2016-07-14 14:59 ` Steve Wise
[not found] ` <011301d1dde0$4450e4e0$ccf2aea0$@opengridcomputing.com>
2016-07-14 15:02 ` Steve Wise [this message]
2016-07-14 15:26 ` Steve Wise
2016-07-14 21:27 ` Steve Wise
2016-07-15 15:52 ` Steve Wise
2016-07-17 6:01 ` Sagi Grimberg
2016-07-18 14:55 ` Steve Wise
2016-07-18 15:47 ` Steve Wise
2016-07-18 16:34 ` Steve Wise
2016-07-18 18:04 ` Steve Wise
2016-07-13 21:58 ` [PATCH 0/2] nvme-rdma: device removal crash fixes Steve Wise
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='011c01d1dde0$cc2f74d0$648e5e70$@opengridcomputing.com' \
--to=swise@opengridcomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.