From mboxrd@z Thu Jan  1 00:00:00 1970
From: swise@opengridcomputing.com (Steve Wise)
Date: Thu, 16 Jun 2016 14:59:21 -0500
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
In-Reply-To: <5763044A.9090206@grimberg.me>
References: <00d801d1c7de$e17fc7d0$a47f5770$@opengridcomputing.com>
 <20160616145724.GA32635@infradead.org>
 <017001d1c7e7$95057270$bf105750$@opengridcomputing.com>
 <5763044A.9090206@grimberg.me>
Message-ID: <01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com>

> 
> >> On Thu, Jun 16, 2016@09:53:45AM -0500, Steve Wise wrote:
> >>> [11436.603807] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> >>> [11436.609866] BUG: unable to handle kernel NULL pointer dereference at
> >>> 0000000000000050
> >>> [11436.617764] IP: [<ffffffffa09c6dff>] nvmet_rdma_delete_ctrl+0x6f/0x100
> >>
> >> Can you check using gdb where in the code this is?
> >
> >
> > nvmet_rdma_delete_ctrl():
> > /root/nvmef/nvme-fabrics/drivers/nvme/target/rdma.c:1302
> >                          &nvmet_rdma_queue_list, queue_list) {
> >                  if (queue->nvme_sq.ctrl->cntlid == ctrl->cntlid)
> >       df6:       48 8b 40 38             mov    0x38(%rax),%rax
> >       dfa:       41 0f b7 4d 50          movzwl 0x50(%r13),%ecx
> >       dff:       66 39 48 50             cmp    %cx,0x50(%rax)
<===========
> > here
> >       e03:       75 cd                   jne    dd2
<nvmet_rdma_delete_ctrl+0x42>
> 
> Umm, I think this might be happening because we get to delete_ctrl when
> one of our queues has a NULL ctrl. This means that either:
> 1. we never got a chance to initialize it, or
> 2. we already freed it.
> 
> (1) doesn't seem possible as we have a very short window (that we're
> better off eliminating) between when we start the keep-alive timer (in
> alloc_ctrl) and the time we assign the sq->ctrl (install_queue).
> 
> (2) doesn't seem likely either to me at least as from what I followed,
> delete_ctrl should be mutual exclusive with other deletions, moreover,
> I didn't see an indication in the logs that any other deletions are
> happening.
> 
> Steve, is this something that started happening recently? does the
> 4.6-rc3 tag suffer from the same phenomenon?

I'll try and reproduce this on the older code, but the keep-alive timer fired
for some other reason, so I'm not sure the target side keep-alive has been
tested until now.  But it is easy to test over iWARP, just do this while a heavy
fio is running:

ifconfig ethX down; sleep 15; ifconfig ethX <ipaddr>/<mask> up