From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Thu, 16 Jun 2016 11:41:59 -0500 Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics In-Reply-To: <012a01d1c7e3$31c08820$95419860$@opengridcomputing.com> References: <00d801d1c7de$e17fc7d0$a47f5770$@opengridcomputing.com> <20160616145724.GA32635@infradead.org> <012a01d1c7e3$31c08820$95419860$@opengridcomputing.com> Message-ID: <017b01d1c7ee$008d7730$01a86590$@opengridcomputing.com> > > > > On Thu, Jun 16, 2016@09:53:45AM -0500, Steve Wise wrote: > > > [11436.603807] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! > > > [11436.609866] BUG: unable to handle kernel NULL pointer dereference at > > > 0000000000000050 > > > [11436.617764] IP: [] nvmet_rdma_delete_ctrl+0x6f/0x100 > > > > Can you check using gdb where in the code this is? > > > > This is the obvious crash we'll need to fix first. Then we'll need to > > figure out why the keep alive timer times out under this workload. > > > > While Yoichi is gathering this on his setup, I'm trying to reproduce it on mine. > I hit a similar crash by loading up a fio job, and then bringing down the > interface of the port used on the host node, let the target timer expire, then > bring the host interface back up. The target freed the queues, and eventually > the host reconnected, and the test continued. But shortly after that I hit this > on the target. It looks related: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > IP: [] nvmet_rdma_queue_disconnect+0x49/0x90 [nvmet_rdma] > PGD 102f0d1067 PUD 102ccc5067 PMD 0 > Oops: 0002 [#1] SMP Your patch you sent out seems to resolve my crash. We'll see if Yoichi has the same results. Steve.