From: swise@opengridcomputing.com (Steve Wise)
Subject: nvme/rdma initiator stuck on reboot
Date: Thu, 18 Aug 2016 09:47:52 -0500 [thread overview]
Message-ID: <017601d1f95f$7f270cd0$7d752670$@opengridcomputing.com> (raw)
In-Reply-To: <012701d1f958$b4953290$1dbf97b0$@opengridcomputing.com>
> >
> > >> Can this be related due to the fact that we use a signle-threaded
> > >> workqueue for delete/reset/reconnect? (delete cancel_sync the active
> > >> reconnect work...)
> > >>
> > >> Does this untested patch help?
> > >
> > > That seems to do it!
> >
> > Is this a formal tested-by?
>
> Sure,
While the patch worked for deleting the controllers, it still hangs if I reboot
the host after the target reboots and the host begins kato recovery. Looks like
the reconnect thread just gets stuck doing this:
[ 947.095936] nvme nvme4: Failed reconnect attempt, requeueing...
[ 947.616015] nvme nvme5: rdma_resolve_addr wait failed (-110).
[ 947.623943] nvme nvme5: Failed reconnect attempt, requeueing...
[ 948.128012] nvme nvme6: rdma_resolve_addr wait failed (-110).
[ 948.135956] nvme nvme6: Failed reconnect attempt, requeueing...
[ 948.624052] nvme nvme7: rdma_resolve_addr wait failed (-104).
I'll try and get a crash dump of this state to look at all the threads. But I
think we need the reconnect worker to give up if the controller it is
reconnecting is getting deleted or the device removed.
>
> but let me ask a question: So the bug was that the delete controller
> worker was blocked waiting for the reconnect worker to complete. Yes? And
the
> reconnect worker was never completing? Why is that? Here are a few tidbits
> about iWARP connections: address resolution == neighbor discovery. So if the
> neighbor is unreachable, it will take a few seconds for the OS to give up and
> fail the resolution. If the neigh entry is valid and the peer becomes
> unreachable during connection setup, it might take 60 seconds or so for a
> connect operation to give up and fail. So this is probably slowing the
> reconnect thread down. But shouldn't the reconnect thread notice that a
delete
> is trying to happen and bail out?
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2016-08-18 14:47 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-16 19:40 nvme/rdma initiator stuck on reboot Steve Wise
2016-08-17 10:23 ` Sagi Grimberg
2016-08-17 14:33 ` Steve Wise
2016-08-17 14:46 ` Sagi Grimberg
2016-08-17 15:13 ` Steve Wise
2016-08-18 7:01 ` Sagi Grimberg
2016-08-18 13:59 ` Steve Wise
2016-08-18 14:47 ` Steve Wise [this message]
2016-08-18 15:21 ` 'Christoph Hellwig'
2016-08-18 17:59 ` Steve Wise
2016-08-18 18:50 ` Steve Wise
2016-08-18 19:11 ` Steve Wise
2016-08-19 8:58 ` Sagi Grimberg
2016-08-19 14:22 ` Steve Wise
[not found] ` <008001d1fa25$0c960fb0$25c22f10$@opengridcomputing.com>
2016-08-19 14:24 ` Steve Wise
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='017601d1f95f$7f270cd0$7d752670$@opengridcomputing.com' \
--to=swise@opengridcomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.