host/target keep alive timeout loop

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* host/target keep alive timeout loop
       [not found] ` <057401d2394f$0b6195b0$2224c110$@opengridcomputing.com>
@ 2016-11-08 10:23   ` Sagi Grimberg
  2016-11-08 14:56     ` Steve Wise
  0 siblings, 1 reply; 2+ messages in thread
From: Sagi Grimberg @ 2016-11-08 10:23 UTC (permalink / raw)


> Hey Sagi/Christoph,
>
> While running the same kato/recovery tests I've logged in a few other threads,
> occasionally I get some controllers on the host that will not reconnect.  Even
> after I quiesce the test and have the interfaces up and everything is pingable.
> When it gets in this state, some of the 10 controllers are up and ok, and others
> are stuck in this reconnect/fail loop.
>
> The host is stuck continually logging this for one or more controllers:
>
> [ 7885.617176] nvme nvme10: failed nvme_keep_alive_end_io error=16385
> [ 7886.837087] nvme nvme10: rdma_resolve_addr wait failed (-110).
> [ 7890.183979] nvme nvme10: failed to initialize i/o queue: -110
> [ 7890.247538] nvme nvme10: Failed reconnect attempt, requeueing...

This looks like an underlying problem causing the host rdma_connect
to timeout. Did it happen before or is it a new thing?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* host/target keep alive timeout loop
  2016-11-08 10:23   ` host/target keep alive timeout loop Sagi Grimberg
@ 2016-11-08 14:56     ` Steve Wise
  0 siblings, 0 replies; 2+ messages in thread
From: Steve Wise @ 2016-11-08 14:56 UTC (permalink / raw)


 
> > Hey Sagi/Christoph,
> >
> > While running the same kato/recovery tests I've logged in a few other
threads,
> > occasionally I get some controllers on the host that will not reconnect.
Even
> > after I quiesce the test and have the interfaces up and everything is
pingable.
> > When it gets in this state, some of the 10 controllers are up and ok, and
others
> > are stuck in this reconnect/fail loop.
> >
> > The host is stuck continually logging this for one or more controllers:
> >
> > [ 7885.617176] nvme nvme10: failed nvme_keep_alive_end_io error=16385
> > [ 7886.837087] nvme nvme10: rdma_resolve_addr wait failed (-110).
> > [ 7890.183979] nvme nvme10: failed to initialize i/o queue: -110
> > [ 7890.247538] nvme nvme10: Failed reconnect attempt, requeueing...
> 
> This looks like an underlying problem causing the host rdma_connect
> to timeout. Did it happen before or is it a new thing?

I can't say for sure, since until Christoph's recent fix, it was crashing in
other ways.   I don't think the connection is failing at the driver level, but
I'll look into the stats for cxgb4 when it is in this state to see what is
happening at that level.   

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-11-08 14:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <056f01d2394e$ea77ae70$bf670b50$@opengridcomputing.com>
     [not found] ` <057401d2394f$0b6195b0$2224c110$@opengridcomputing.com>
2016-11-08 10:23   ` host/target keep alive timeout loop Sagi Grimberg
2016-11-08 14:56     ` Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).