From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Tue, 28 Jun 2016 11:31:27 -0500 Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics In-Reply-To: <20160628155159.GA3084@lst.de> References: <576306EE.4020306@grimberg.me> <01b901d1c80b$72f83680$58e8a380$@opengridcomputing.com> <01c101d1c80d$96d13c80$c473b580$@opengridcomputing.com> <20160616203437.GA19079@lst.de> <01e701d1c810$91d851c0$b588f540$@opengridcomputing.com> <020201d1c812$ec94b430$c5be1c90$@opengridcomputing.com> <1467066582.7205.7.camel@ssi> <20160628091433.GA14149@lst.de> <005001d1d147$81cd8cb0$8568a610$@opengridcomputing.com> <20160628155159.GA3084@lst.de> Message-ID: <01dc01d1d15a$84f42670$8edc7350$@opengridcomputing.com> > On Tue, Jun 28, 2016@09:15:22AM -0500, Steve Wise wrote: > > I'm not so sure. I don't see where nvmet leaves unsignaled wrs on the SQ. > > It either posts chains via RDMA-RW and the last in the chain is always > > signaled (I think), or it posts signaled IO responses. > > Indeed. So we need to figure out where we don't release a rsp. > Hey Ming, For what its worth, the change you proposed in this thread isn't working for me. I see maybe one or two recoveries successful, then the target gets stuck. I see several workq threads stuck destroying various qps, one thread stuck draining a qp. If this change is not the proper fix, then I'm not going to debug this further.