All of lore.kernel.org
 help / color / mirror / Atom feed
From: swise@opengridcomputing.com (Steve Wise)
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
Date: Thu, 16 Jun 2016 14:59:21 -0500	[thread overview]
Message-ID: <01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com> (raw)
In-Reply-To: <5763044A.9090206@grimberg.me>

> 
> >> On Thu, Jun 16, 2016@09:53:45AM -0500, Steve Wise wrote:
> >>> [11436.603807] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> >>> [11436.609866] BUG: unable to handle kernel NULL pointer dereference at
> >>> 0000000000000050
> >>> [11436.617764] IP: [<ffffffffa09c6dff>] nvmet_rdma_delete_ctrl+0x6f/0x100
> >>
> >> Can you check using gdb where in the code this is?
> >
> >
> > nvmet_rdma_delete_ctrl():
> > /root/nvmef/nvme-fabrics/drivers/nvme/target/rdma.c:1302
> >                          &nvmet_rdma_queue_list, queue_list) {
> >                  if (queue->nvme_sq.ctrl->cntlid == ctrl->cntlid)
> >       df6:       48 8b 40 38             mov    0x38(%rax),%rax
> >       dfa:       41 0f b7 4d 50          movzwl 0x50(%r13),%ecx
> >       dff:       66 39 48 50             cmp    %cx,0x50(%rax)
<===========
> > here
> >       e03:       75 cd                   jne    dd2
<nvmet_rdma_delete_ctrl+0x42>
> 
> Umm, I think this might be happening because we get to delete_ctrl when
> one of our queues has a NULL ctrl. This means that either:
> 1. we never got a chance to initialize it, or
> 2. we already freed it.
> 
> (1) doesn't seem possible as we have a very short window (that we're
> better off eliminating) between when we start the keep-alive timer (in
> alloc_ctrl) and the time we assign the sq->ctrl (install_queue).
> 
> (2) doesn't seem likely either to me at least as from what I followed,
> delete_ctrl should be mutual exclusive with other deletions, moreover,
> I didn't see an indication in the logs that any other deletions are
> happening.
> 
> Steve, is this something that started happening recently? does the
> 4.6-rc3 tag suffer from the same phenomenon?

I'll try and reproduce this on the older code, but the keep-alive timer fired
for some other reason, so I'm not sure the target side keep-alive has been
tested until now.  But it is easy to test over iWARP, just do this while a heavy
fio is running:

ifconfig ethX down; sleep 15; ifconfig ethX <ipaddr>/<mask> up

  reply	other threads:[~2016-06-16 19:59 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 14:53 target crash / host hang with nvme-all.3 branch of nvme-fabrics Steve Wise
2016-06-16 14:57 ` Christoph Hellwig
2016-06-16 15:10   ` Christoph Hellwig
2016-06-16 15:17     ` Steve Wise
2016-06-16 19:11     ` Sagi Grimberg
2016-06-16 20:38       ` Christoph Hellwig
2016-06-16 21:37         ` Sagi Grimberg
2016-06-16 21:40           ` Sagi Grimberg
2016-06-21 16:01           ` Christoph Hellwig
2016-06-22 10:22             ` Sagi Grimberg
2016-06-16 15:24   ` Steve Wise
2016-06-16 16:41     ` Steve Wise
2016-06-16 15:56   ` Steve Wise
2016-06-16 19:55     ` Sagi Grimberg
2016-06-16 19:59       ` Steve Wise [this message]
2016-06-16 20:07         ` Sagi Grimberg
2016-06-16 20:12           ` Steve Wise
2016-06-16 20:27             ` Ming Lin
2016-06-16 20:28               ` Steve Wise
2016-06-16 20:34                 ` 'Christoph Hellwig'
2016-06-16 20:49                   ` Steve Wise
2016-06-16 21:06                     ` Steve Wise
2016-06-16 21:42                       ` Sagi Grimberg
2016-06-16 21:47                         ` Ming Lin
2016-06-16 21:53                           ` Steve Wise
2016-06-16 21:46                       ` Steve Wise
2016-06-27 22:29                       ` Ming Lin
2016-06-28  9:14                         ` 'Christoph Hellwig'
2016-06-28 14:15                           ` Steve Wise
2016-06-28 15:51                             ` 'Christoph Hellwig'
2016-06-28 16:31                               ` Steve Wise
2016-06-28 16:49                                 ` Ming Lin
2016-06-28 19:20                                   ` Steve Wise
2016-06-28 19:43                                     ` Steve Wise
2016-06-28 21:04                                       ` Ming Lin
2016-06-29 14:11                                         ` Steve Wise
2016-06-27 17:26                   ` Ming Lin
2016-06-16 20:35           ` Steve Wise
2016-06-16 20:01       ` Steve Wise
2016-06-17 14:05       ` Steve Wise
     [not found]       ` <005f01d1c8a1$5a229240$0e67b6c0$@opengridcomputing.com>
2016-06-17 14:16         ` Steve Wise
2016-06-17 17:20           ` Ming Lin
2016-06-19 11:57             ` Sagi Grimberg
2016-06-21 14:18               ` Steve Wise
2016-06-21 17:33                 ` Ming Lin
2016-06-21 17:59                   ` Steve Wise
     [not found]               ` <006e01d1cbc7$d0d9cc40$728d64c0$@opengridcomputing.com>
2016-06-22 13:42                 ` Steve Wise
2016-06-27 14:19                   ` Steve Wise
2016-06-28  8:50                     ` 'Christoph Hellwig'
2016-07-04  9:57                       ` Yoichi Hayakawa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.