nvme/rdma initiator stuck on reboot

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

From: swise@opengridcomputing.com (Steve Wise)
Subject: nvme/rdma initiator stuck on reboot
Date: Fri, 19 Aug 2016 09:22:00 -0500	[thread overview]
Message-ID: <007f01d1fa25$0c8ce7f0$25a6b7d0$@opengridcomputing.com> (raw)
In-Reply-To: <f06590ad-e408-e234-2ec4-7f28978411b7@grimberg.me>

> 
> > Btw, in that case the patch is not actually correct, as even workqueue
> > with a higher concurrency level MAY deadlock under enough memory
> > pressure.  We'll need separate workqueues to handle this case I think.
> 
> Steve, does it help if you run the delete on the system_long_wq [1]?
> Note, I've seen problems with forward progress when sharing
> a workqueue between teardown/reconnect sequences and the rest of
> the system (mostly in srp).
> 

I can try this, but see my comments below.  I'm not sure there is any deadlock
at this point..

> >> Yes?  And the
> >> reconnect worker was never completing?  Why is that?  Here are a few
tidbits
> >> about iWARP connections:  address resolution == neighbor discovery.  So if
the
> >> neighbor is unreachable, it will take a few seconds for the OS to give up
and
> >> fail the resolution.  If the neigh entry is valid and the peer becomes
> >> unreachable during connection setup, it might take 60 seconds or so for a
> >> connect operation to give up and fail.  So this is probably slowing the
> >> reconnect thread down.   But shouldn't the reconnect thread notice that a
delete
> >> is trying to happen and bail out?
> >
> > I think we should aim for a state machine that can detect this, but
> > we'll have to see if that will end up in synchronization overkill.
> 
> The reconnect logic does take care of this state transition...

Yes, I agree.  The disconnect/delete command changes the controller state from
RECONNECTING to DELETEING and the reconnecting thread will not reschedule itself
for that controller.

In further debugging (see my subsequent emails), it appears that there really
isn't a deadlock.   First, let me describe the main problem: the IWCM will block
destroying a cm_id until the driver has completed a connection setup attempt.
See IWCM_F_CONNECT_WAIT in drivers/infiniband/core/iwcm.c.  Further, iw_cxgb4's
TCP engine can take up to 60 seconds to fail a TCP connection setup if the neigh
entry is valid yet the peer is unresponsive.   So what we see happening is that
when kato kicks in after the target reboots and _before_ the neigh entry for the
target is flushed due to no connectivity, the connection setup attempts all get
initiated by the reconnect logic in the nvmf/rdma host driver.  Even though the
nvme/rdma host driver times out such an attempt in ~1sec
(NVME_RDMA_CONNECT_TIMEOUT_MS), it gets stuck for up to 60 seconds destroying
the cm_id.   So for my setup, each controller has 32 io queues.  That causes the
reconnect thread to get stuck for VERY LONG periods of time.  Even if the
controllers are deleted thus changing the state of the controller to DELETING,
the thread will still get stuck for at least 60 seconds trying to destroy its
current connecting cm_id.  Then you multiply that by 10 controllers in my test
and you see that the reconnect logic is taking way too long to give up.

So I think I need to see about removing the IWCM_F_CONNECT_WAIT logic in the
iwcm.

One other thing:  in both nvme_rdma_device_unplug() and nvme_rdma_del_ctrl(),
the code kicks the delete_work thread to delete the controller and then calls
flush_work().  This is a possible touch-after-free, no?  The proper way, I
think, should be to take a ref on ctrl, kick the delete_work thread, call
flush_work(), and then nvme_put_ctrl(ctrl).  Do you agree?  While doing this
debug, I wondered if this issue was causing a delete thread to get stuck in
flush_work().  I never proved that, and I think the real issue is the
IWCM_F_CONNECT_WAIT logic.

Steve.

next prev parent reply	other threads:[~2016-08-19 14:22 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-16 19:40 nvme/rdma initiator stuck on reboot Steve Wise
2016-08-17 10:23 ` Sagi Grimberg
2016-08-17 14:33   ` Steve Wise
2016-08-17 14:46     ` Sagi Grimberg
2016-08-17 15:13       ` Steve Wise
2016-08-18  7:01         ` Sagi Grimberg
2016-08-18 13:59           ` Steve Wise
2016-08-18 14:47             ` Steve Wise
2016-08-18 15:21             ` 'Christoph Hellwig'
2016-08-18 17:59               ` Steve Wise
2016-08-18 18:50                 ` Steve Wise
2016-08-18 19:11                   ` Steve Wise
2016-08-19  8:58               ` Sagi Grimberg
2016-08-19 14:22                 ` Steve Wise [this message]
     [not found]                 ` <008001d1fa25$0c960fb0$25c22f10$@opengridcomputing.com>
2016-08-19 14:24                   ` Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='007f01d1fa25$0c8ce7f0$25a6b7d0$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).