All of lore.kernel.org
 help / color / mirror / Atom feed
From: swise@opengridcomputing.com (Steve Wise)
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
Date: Thu, 16 Jun 2016 15:12:46 -0500	[thread overview]
Message-ID: <01b901d1c80b$72f83680$58e8a380$@opengridcomputing.com> (raw)
In-Reply-To: <576306EE.4020306@grimberg.me>



> >>
> >> Umm, I think this might be happening because we get to delete_ctrl when
> >> one of our queues has a NULL ctrl. This means that either:
> >> 1. we never got a chance to initialize it, or
> >> 2. we already freed it.
> >>
> >> (1) doesn't seem possible as we have a very short window (that we're
> >> better off eliminating) between when we start the keep-alive timer (in
> >> alloc_ctrl) and the time we assign the sq->ctrl (install_queue).
> >>
> >> (2) doesn't seem likely either to me at least as from what I followed,
> >> delete_ctrl should be mutual exclusive with other deletions, moreover,
> >> I didn't see an indication in the logs that any other deletions are
> >> happening.
> >>
> >> Steve, is this something that started happening recently? does the
> >> 4.6-rc3 tag suffer from the same phenomenon?
> >
> > I'll try and reproduce this on the older code, but the keep-alive timer
fired
> > for some other reason,
> 
> My assumption was that it fired because it didn't get a keep-alive from
> the host which is exactly what it's supposed to do?
>

Yes, in the original email I started this thread with, I show that on the host,
2 cpus were stuck, and I surmise that the host node was stuck NVMF-wise and thus
the target timer kicked and crashed the target. 
 
> > so I'm not sure the target side keep-alive has been
> > tested until now.
> 
> I tested it, and IIRC the original patch had Ming's tested-by tag.
> 

How did you test it?

> > But it is easy to test over iWARP, just do this while a heavy
> > fio is running:
> >
> > ifconfig ethX down; sleep 15; ifconfig ethX <ipaddr>/<mask> up
> 
> So this is related to I/O load then? Does it happen when
> you just do it without any I/O? (or small load)?

I'll try this.  

Note there are two sets of crashes discussed in this thread:  the one Yoichi saw
on his nodes where the host hung causing the target keep-alive to fire and
crash.  That is the crash with stack traces I included in the original email
starting this thread.   And then there is a repeatable crash on my setup, which
looks the same, that happens when I bring the interface down long enough to kick
the keep-alive.  Since I can reproduce the latter easily I'm continuing with
this debug.

Here is the fio command I use:

fio --bs=1k --time_based --runtime=2000 --numjobs=8 --name=TEST-1k-8g-20-8-32
--direct=1 --iodepth=32 -rw=randread --randrepeat=0 --norandommap --loops=1
--exitall --ioengine=libaio --filename=/dev/nvme1n1

  reply	other threads:[~2016-06-16 20:12 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 14:53 target crash / host hang with nvme-all.3 branch of nvme-fabrics Steve Wise
2016-06-16 14:57 ` Christoph Hellwig
2016-06-16 15:10   ` Christoph Hellwig
2016-06-16 15:17     ` Steve Wise
2016-06-16 19:11     ` Sagi Grimberg
2016-06-16 20:38       ` Christoph Hellwig
2016-06-16 21:37         ` Sagi Grimberg
2016-06-16 21:40           ` Sagi Grimberg
2016-06-21 16:01           ` Christoph Hellwig
2016-06-22 10:22             ` Sagi Grimberg
2016-06-16 15:24   ` Steve Wise
2016-06-16 16:41     ` Steve Wise
2016-06-16 15:56   ` Steve Wise
2016-06-16 19:55     ` Sagi Grimberg
2016-06-16 19:59       ` Steve Wise
2016-06-16 20:07         ` Sagi Grimberg
2016-06-16 20:12           ` Steve Wise [this message]
2016-06-16 20:27             ` Ming Lin
2016-06-16 20:28               ` Steve Wise
2016-06-16 20:34                 ` 'Christoph Hellwig'
2016-06-16 20:49                   ` Steve Wise
2016-06-16 21:06                     ` Steve Wise
2016-06-16 21:42                       ` Sagi Grimberg
2016-06-16 21:47                         ` Ming Lin
2016-06-16 21:53                           ` Steve Wise
2016-06-16 21:46                       ` Steve Wise
2016-06-27 22:29                       ` Ming Lin
2016-06-28  9:14                         ` 'Christoph Hellwig'
2016-06-28 14:15                           ` Steve Wise
2016-06-28 15:51                             ` 'Christoph Hellwig'
2016-06-28 16:31                               ` Steve Wise
2016-06-28 16:49                                 ` Ming Lin
2016-06-28 19:20                                   ` Steve Wise
2016-06-28 19:43                                     ` Steve Wise
2016-06-28 21:04                                       ` Ming Lin
2016-06-29 14:11                                         ` Steve Wise
2016-06-27 17:26                   ` Ming Lin
2016-06-16 20:35           ` Steve Wise
2016-06-16 20:01       ` Steve Wise
2016-06-17 14:05       ` Steve Wise
     [not found]       ` <005f01d1c8a1$5a229240$0e67b6c0$@opengridcomputing.com>
2016-06-17 14:16         ` Steve Wise
2016-06-17 17:20           ` Ming Lin
2016-06-19 11:57             ` Sagi Grimberg
2016-06-21 14:18               ` Steve Wise
2016-06-21 17:33                 ` Ming Lin
2016-06-21 17:59                   ` Steve Wise
     [not found]               ` <006e01d1cbc7$d0d9cc40$728d64c0$@opengridcomputing.com>
2016-06-22 13:42                 ` Steve Wise
2016-06-27 14:19                   ` Steve Wise
2016-06-28  8:50                     ` 'Christoph Hellwig'
2016-07-04  9:57                       ` Yoichi Hayakawa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='01b901d1c80b$72f83680$58e8a380$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.