From mboxrd@z Thu Jan 1 00:00:00 1970 From: sagi@grimberg.me (Sagi Grimberg) Date: Thu, 16 Jun 2016 23:07:10 +0300 Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics In-Reply-To: <01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com> References: <00d801d1c7de$e17fc7d0$a47f5770$@opengridcomputing.com> <20160616145724.GA32635@infradead.org> <017001d1c7e7$95057270$bf105750$@opengridcomputing.com> <5763044A.9090206@grimberg.me> <01b501d1c809$92cb1a60$b8614f20$@opengridcomputing.com> Message-ID: <576306EE.4020306@grimberg.me> >> >> Umm, I think this might be happening because we get to delete_ctrl when >> one of our queues has a NULL ctrl. This means that either: >> 1. we never got a chance to initialize it, or >> 2. we already freed it. >> >> (1) doesn't seem possible as we have a very short window (that we're >> better off eliminating) between when we start the keep-alive timer (in >> alloc_ctrl) and the time we assign the sq->ctrl (install_queue). >> >> (2) doesn't seem likely either to me at least as from what I followed, >> delete_ctrl should be mutual exclusive with other deletions, moreover, >> I didn't see an indication in the logs that any other deletions are >> happening. >> >> Steve, is this something that started happening recently? does the >> 4.6-rc3 tag suffer from the same phenomenon? > > I'll try and reproduce this on the older code, but the keep-alive timer fired > for some other reason, My assumption was that it fired because it didn't get a keep-alive from the host which is exactly what it's supposed to do? > so I'm not sure the target side keep-alive has been > tested until now. I tested it, and IIRC the original patch had Ming's tested-by tag. > But it is easy to test over iWARP, just do this while a heavy > fio is running: > > ifconfig ethX down; sleep 15; ifconfig ethX / up So this is related to I/O load then? Does it happen when you just do it without any I/O? (or small load)?