From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Thu, 16 Jun 2016 15:28:06 -0500 Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics In-Reply-To: References: <00d801d1c7de$e17fc7d0$a47f5770$@opengridcomputing.com> <20160616145724.GA32635@infradead.org> <017001d1c7e7$95057270$bf105750$@opengridcomputing.com> <5763044A.9090206@grimberg.me> <01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com> <576306EE.4020306@grimberg.me> <01b901d1c80b$72f83680$58e8a380$@opengridcomputing.com> Message-ID: <01c101d1c80d$96d13c80$c473b580$@opengridcomputing.com> > On Thu, Jun 16, 2016 at 1:12 PM, Steve Wise > wrote: > > > > > >> >> > >> >> Umm, I think this might be happening because we get to delete_ctrl when > >> >> one of our queues has a NULL ctrl. This means that either: > >> >> 1. we never got a chance to initialize it, or > >> >> 2. we already freed it. > >> >> > >> >> (1) doesn't seem possible as we have a very short window (that we're > >> >> better off eliminating) between when we start the keep-alive timer (in > >> >> alloc_ctrl) and the time we assign the sq->ctrl (install_queue). > >> >> > >> >> (2) doesn't seem likely either to me at least as from what I followed, > >> >> delete_ctrl should be mutual exclusive with other deletions, moreover, > >> >> I didn't see an indication in the logs that any other deletions are > >> >> happening. > >> >> > >> >> Steve, is this something that started happening recently? does the > >> >> 4.6-rc3 tag suffer from the same phenomenon? > >> > > >> > I'll try and reproduce this on the older code, but the keep-alive timer > > fired > >> > for some other reason, > >> > >> My assumption was that it fired because it didn't get a keep-alive from > >> the host which is exactly what it's supposed to do? > >> > > > > Yes, in the original email I started this thread with, I show that on the host, > > 2 cpus were stuck, and I surmise that the host node was stuck NVMF-wise and thus > > the target timer kicked and crashed the target. > > > >> > so I'm not sure the target side keep-alive has been > >> > tested until now. > >> > >> I tested it, and IIRC the original patch had Ming's tested-by tag. > >> > > > > How did you test it? > > > >> > But it is easy to test over iWARP, just do this while a heavy > >> > fio is running: > >> > > >> > ifconfig ethX down; sleep 15; ifconfig ethX / up > >> > >> So this is related to I/O load then? Does it happen when > >> you just do it without any I/O? (or small load)? > > > > I'll try this. > > > > Note there are two sets of crashes discussed in this thread: the one Yoichi saw > > on his nodes where the host hung causing the target keep-alive to fire and > > crash. That is the crash with stack traces I included in the original email > > starting this thread. And then there is a repeatable crash on my setup, which > > looks the same, that happens when I bring the interface down long enough to kick > > the keep-alive. Since I can reproduce the latter easily I'm continuing with > > this debug. > > > > Here is the fio command I use: > > > > fio --bs=1k --time_based --runtime=2000 --numjobs=8 --name=TEST-1k-8g-20-8-32 > > --direct=1 --iodepth=32 -rw=randread --randrepeat=0 --norandommap --loops=1 > > --exitall --ioengine=libaio --filename=/dev/nvme1n1 > > Hi Steve, > > Just to follow, does Christoph's patch fix the crash? It does.