From: swise@opengridcomputing.com (Steve Wise)
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
Date: Thu, 16 Jun 2016 15:28:06 -0500 [thread overview]
Message-ID: <01c101d1c80d$96d13c80$c473b580$@opengridcomputing.com> (raw)
In-Reply-To: <CAF1ivSYUG4c7Ej-gNqA=aPFR2zkNq8KhBoodhp64wdY=eQLx6g@mail.gmail.com>
> On Thu, Jun 16, 2016 at 1:12 PM, Steve Wise <swise at opengridcomputing.com>
> wrote:
> >
> >
> >> >>
> >> >> Umm, I think this might be happening because we get to delete_ctrl when
> >> >> one of our queues has a NULL ctrl. This means that either:
> >> >> 1. we never got a chance to initialize it, or
> >> >> 2. we already freed it.
> >> >>
> >> >> (1) doesn't seem possible as we have a very short window (that we're
> >> >> better off eliminating) between when we start the keep-alive timer (in
> >> >> alloc_ctrl) and the time we assign the sq->ctrl (install_queue).
> >> >>
> >> >> (2) doesn't seem likely either to me at least as from what I followed,
> >> >> delete_ctrl should be mutual exclusive with other deletions, moreover,
> >> >> I didn't see an indication in the logs that any other deletions are
> >> >> happening.
> >> >>
> >> >> Steve, is this something that started happening recently? does the
> >> >> 4.6-rc3 tag suffer from the same phenomenon?
> >> >
> >> > I'll try and reproduce this on the older code, but the keep-alive timer
> > fired
> >> > for some other reason,
> >>
> >> My assumption was that it fired because it didn't get a keep-alive from
> >> the host which is exactly what it's supposed to do?
> >>
> >
> > Yes, in the original email I started this thread with, I show that on the host,
> > 2 cpus were stuck, and I surmise that the host node was stuck NVMF-wise and thus
> > the target timer kicked and crashed the target.
> >
> >> > so I'm not sure the target side keep-alive has been
> >> > tested until now.
> >>
> >> I tested it, and IIRC the original patch had Ming's tested-by tag.
> >>
> >
> > How did you test it?
> >
> >> > But it is easy to test over iWARP, just do this while a heavy
> >> > fio is running:
> >> >
> >> > ifconfig ethX down; sleep 15; ifconfig ethX <ipaddr>/<mask> up
> >>
> >> So this is related to I/O load then? Does it happen when
> >> you just do it without any I/O? (or small load)?
> >
> > I'll try this.
> >
> > Note there are two sets of crashes discussed in this thread: the one Yoichi saw
> > on his nodes where the host hung causing the target keep-alive to fire and
> > crash. That is the crash with stack traces I included in the original email
> > starting this thread. And then there is a repeatable crash on my setup, which
> > looks the same, that happens when I bring the interface down long enough to kick
> > the keep-alive. Since I can reproduce the latter easily I'm continuing with
> > this debug.
> >
> > Here is the fio command I use:
> >
> > fio --bs=1k --time_based --runtime=2000 --numjobs=8 --name=TEST-1k-8g-20-8-32
> > --direct=1 --iodepth=32 -rw=randread --randrepeat=0 --norandommap --loops=1
> > --exitall --ioengine=libaio --filename=/dev/nvme1n1
>
> Hi Steve,
>
> Just to follow, does Christoph's patch fix the crash?
It does.
next prev parent reply other threads:[~2016-06-16 20:28 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-16 14:53 target crash / host hang with nvme-all.3 branch of nvme-fabrics Steve Wise
2016-06-16 14:57 ` Christoph Hellwig
2016-06-16 15:10 ` Christoph Hellwig
2016-06-16 15:17 ` Steve Wise
2016-06-16 19:11 ` Sagi Grimberg
2016-06-16 20:38 ` Christoph Hellwig
2016-06-16 21:37 ` Sagi Grimberg
2016-06-16 21:40 ` Sagi Grimberg
2016-06-21 16:01 ` Christoph Hellwig
2016-06-22 10:22 ` Sagi Grimberg
2016-06-16 15:24 ` Steve Wise
2016-06-16 16:41 ` Steve Wise
2016-06-16 15:56 ` Steve Wise
2016-06-16 19:55 ` Sagi Grimberg
2016-06-16 19:59 ` Steve Wise
2016-06-16 20:07 ` Sagi Grimberg
2016-06-16 20:12 ` Steve Wise
2016-06-16 20:27 ` Ming Lin
2016-06-16 20:28 ` Steve Wise [this message]
2016-06-16 20:34 ` 'Christoph Hellwig'
2016-06-16 20:49 ` Steve Wise
2016-06-16 21:06 ` Steve Wise
2016-06-16 21:42 ` Sagi Grimberg
2016-06-16 21:47 ` Ming Lin
2016-06-16 21:53 ` Steve Wise
2016-06-16 21:46 ` Steve Wise
2016-06-27 22:29 ` Ming Lin
2016-06-28 9:14 ` 'Christoph Hellwig'
2016-06-28 14:15 ` Steve Wise
2016-06-28 15:51 ` 'Christoph Hellwig'
2016-06-28 16:31 ` Steve Wise
2016-06-28 16:49 ` Ming Lin
2016-06-28 19:20 ` Steve Wise
2016-06-28 19:43 ` Steve Wise
2016-06-28 21:04 ` Ming Lin
2016-06-29 14:11 ` Steve Wise
2016-06-27 17:26 ` Ming Lin
2016-06-16 20:35 ` Steve Wise
2016-06-16 20:01 ` Steve Wise
2016-06-17 14:05 ` Steve Wise
[not found] ` <005f01d1c8a1$5a229240$0e67b6c0$@opengridcomputing.com>
2016-06-17 14:16 ` Steve Wise
2016-06-17 17:20 ` Ming Lin
2016-06-19 11:57 ` Sagi Grimberg
2016-06-21 14:18 ` Steve Wise
2016-06-21 17:33 ` Ming Lin
2016-06-21 17:59 ` Steve Wise
[not found] ` <006e01d1cbc7$d0d9cc40$728d64c0$@opengridcomputing.com>
2016-06-22 13:42 ` Steve Wise
2016-06-27 14:19 ` Steve Wise
2016-06-28 8:50 ` 'Christoph Hellwig'
2016-07-04 9:57 ` Yoichi Hayakawa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='01c101d1c80d$96d13c80$c473b580$@opengridcomputing.com' \
--to=swise@opengridcomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.