From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Thu, 16 Jun 2016 14:59:21 -0500 Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics In-Reply-To: <5763044A.9090206@grimberg.me> References: <00d801d1c7de$e17fc7d0$a47f5770$@opengridcomputing.com> <20160616145724.GA32635@infradead.org> <017001d1c7e7$95057270$bf105750$@opengridcomputing.com> <5763044A.9090206@grimberg.me> Message-ID: <01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com> > > >> On Thu, Jun 16, 2016@09:53:45AM -0500, Steve Wise wrote: > >>> [11436.603807] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! > >>> [11436.609866] BUG: unable to handle kernel NULL pointer dereference at > >>> 0000000000000050 > >>> [11436.617764] IP: [] nvmet_rdma_delete_ctrl+0x6f/0x100 > >> > >> Can you check using gdb where in the code this is? > > > > > > nvmet_rdma_delete_ctrl(): > > /root/nvmef/nvme-fabrics/drivers/nvme/target/rdma.c:1302 > > &nvmet_rdma_queue_list, queue_list) { > > if (queue->nvme_sq.ctrl->cntlid == ctrl->cntlid) > > df6: 48 8b 40 38 mov 0x38(%rax),%rax > > dfa: 41 0f b7 4d 50 movzwl 0x50(%r13),%ecx > > dff: 66 39 48 50 cmp %cx,0x50(%rax) <=========== > > here > > e03: 75 cd jne dd2 > > Umm, I think this might be happening because we get to delete_ctrl when > one of our queues has a NULL ctrl. This means that either: > 1. we never got a chance to initialize it, or > 2. we already freed it. > > (1) doesn't seem possible as we have a very short window (that we're > better off eliminating) between when we start the keep-alive timer (in > alloc_ctrl) and the time we assign the sq->ctrl (install_queue). > > (2) doesn't seem likely either to me at least as from what I followed, > delete_ctrl should be mutual exclusive with other deletions, moreover, > I didn't see an indication in the logs that any other deletions are > happening. > > Steve, is this something that started happening recently? does the > 4.6-rc3 tag suffer from the same phenomenon? I'll try and reproduce this on the older code, but the keep-alive timer fired for some other reason, so I'm not sure the target side keep-alive has been tested until now. But it is easy to test over iWARP, just do this while a heavy fio is running: ifconfig ethX down; sleep 15; ifconfig ethX / up