From: swise@opengridcomputing.com (Steve Wise)
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
Date: Thu, 16 Jun 2016 15:12:46 -0500 [thread overview]
Message-ID: <01b901d1c80b$72f83680$58e8a380$@opengridcomputing.com> (raw)
In-Reply-To: <576306EE.4020306@grimberg.me>
> >>
> >> Umm, I think this might be happening because we get to delete_ctrl when
> >> one of our queues has a NULL ctrl. This means that either:
> >> 1. we never got a chance to initialize it, or
> >> 2. we already freed it.
> >>
> >> (1) doesn't seem possible as we have a very short window (that we're
> >> better off eliminating) between when we start the keep-alive timer (in
> >> alloc_ctrl) and the time we assign the sq->ctrl (install_queue).
> >>
> >> (2) doesn't seem likely either to me at least as from what I followed,
> >> delete_ctrl should be mutual exclusive with other deletions, moreover,
> >> I didn't see an indication in the logs that any other deletions are
> >> happening.
> >>
> >> Steve, is this something that started happening recently? does the
> >> 4.6-rc3 tag suffer from the same phenomenon?
> >
> > I'll try and reproduce this on the older code, but the keep-alive timer
fired
> > for some other reason,
>
> My assumption was that it fired because it didn't get a keep-alive from
> the host which is exactly what it's supposed to do?
>
Yes, in the original email I started this thread with, I show that on the host,
2 cpus were stuck, and I surmise that the host node was stuck NVMF-wise and thus
the target timer kicked and crashed the target.
> > so I'm not sure the target side keep-alive has been
> > tested until now.
>
> I tested it, and IIRC the original patch had Ming's tested-by tag.
>
How did you test it?
> > But it is easy to test over iWARP, just do this while a heavy
> > fio is running:
> >
> > ifconfig ethX down; sleep 15; ifconfig ethX <ipaddr>/<mask> up
>
> So this is related to I/O load then? Does it happen when
> you just do it without any I/O? (or small load)?
I'll try this.
Note there are two sets of crashes discussed in this thread: the one Yoichi saw
on his nodes where the host hung causing the target keep-alive to fire and
crash. That is the crash with stack traces I included in the original email
starting this thread. And then there is a repeatable crash on my setup, which
looks the same, that happens when I bring the interface down long enough to kick
the keep-alive. Since I can reproduce the latter easily I'm continuing with
this debug.
Here is the fio command I use:
fio --bs=1k --time_based --runtime=2000 --numjobs=8 --name=TEST-1k-8g-20-8-32
--direct=1 --iodepth=32 -rw=randread --randrepeat=0 --norandommap --loops=1
--exitall --ioengine=libaio --filename=/dev/nvme1n1
next prev parent reply other threads:[~2016-06-16 20:12 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-16 14:53 target crash / host hang with nvme-all.3 branch of nvme-fabrics Steve Wise
2016-06-16 14:57 ` Christoph Hellwig
2016-06-16 15:10 ` Christoph Hellwig
2016-06-16 15:17 ` Steve Wise
2016-06-16 19:11 ` Sagi Grimberg
2016-06-16 20:38 ` Christoph Hellwig
2016-06-16 21:37 ` Sagi Grimberg
2016-06-16 21:40 ` Sagi Grimberg
2016-06-21 16:01 ` Christoph Hellwig
2016-06-22 10:22 ` Sagi Grimberg
2016-06-16 15:24 ` Steve Wise
2016-06-16 16:41 ` Steve Wise
2016-06-16 15:56 ` Steve Wise
2016-06-16 19:55 ` Sagi Grimberg
2016-06-16 19:59 ` Steve Wise
2016-06-16 20:07 ` Sagi Grimberg
2016-06-16 20:12 ` Steve Wise [this message]
2016-06-16 20:27 ` Ming Lin
2016-06-16 20:28 ` Steve Wise
2016-06-16 20:34 ` 'Christoph Hellwig'
2016-06-16 20:49 ` Steve Wise
2016-06-16 21:06 ` Steve Wise
2016-06-16 21:42 ` Sagi Grimberg
2016-06-16 21:47 ` Ming Lin
2016-06-16 21:53 ` Steve Wise
2016-06-16 21:46 ` Steve Wise
2016-06-27 22:29 ` Ming Lin
2016-06-28 9:14 ` 'Christoph Hellwig'
2016-06-28 14:15 ` Steve Wise
2016-06-28 15:51 ` 'Christoph Hellwig'
2016-06-28 16:31 ` Steve Wise
2016-06-28 16:49 ` Ming Lin
2016-06-28 19:20 ` Steve Wise
2016-06-28 19:43 ` Steve Wise
2016-06-28 21:04 ` Ming Lin
2016-06-29 14:11 ` Steve Wise
2016-06-27 17:26 ` Ming Lin
2016-06-16 20:35 ` Steve Wise
2016-06-16 20:01 ` Steve Wise
2016-06-17 14:05 ` Steve Wise
[not found] ` <005f01d1c8a1$5a229240$0e67b6c0$@opengridcomputing.com>
2016-06-17 14:16 ` Steve Wise
2016-06-17 17:20 ` Ming Lin
2016-06-19 11:57 ` Sagi Grimberg
2016-06-21 14:18 ` Steve Wise
2016-06-21 17:33 ` Ming Lin
2016-06-21 17:59 ` Steve Wise
[not found] ` <006e01d1cbc7$d0d9cc40$728d64c0$@opengridcomputing.com>
2016-06-22 13:42 ` Steve Wise
2016-06-27 14:19 ` Steve Wise
2016-06-28 8:50 ` 'Christoph Hellwig'
2016-07-04 9:57 ` Yoichi Hayakawa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='01b901d1c80b$72f83680$58e8a380$@opengridcomputing.com' \
--to=swise@opengridcomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).