From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50944) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UmrVp-0000i5-2h for qemu-devel@nongnu.org; Wed, 12 Jun 2013 16:19:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UmrVn-0003QD-VE for qemu-devel@nongnu.org; Wed, 12 Jun 2013 16:19:21 -0400 Received: from mx1.redhat.com ([209.132.183.28]:3697) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UmrVn-0003Q0-Mg for qemu-devel@nongnu.org; Wed, 12 Jun 2013 16:19:19 -0400 Message-ID: <51B8D7C2.3040907@redhat.com> Date: Wed, 12 Jun 2013 16:19:14 -0400 From: Paolo Bonzini MIME-Version: 1.0 References: <51B70CF2.1020306@suse.de> <20130612075620.GD946@stefanha-thinkpad.muc.redhat.com> In-Reply-To: <20130612075620.GD946@stefanha-thinkpad.muc.redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] virtio-scsi and error handling List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: "qemu-devel@nongnu.org" , Asias He , Hannes Reinecke , Alexander Graf Il 12/06/2013 03:56, Stefan Hajnoczi ha scritto: > On Tue, Jun 11, 2013 at 01:41:38PM +0200, Hannes Reinecke wrote: >> I currently playing around with improving SCSI EH, optimizing >> command aborts and the like. >> >> And, supposing it to be a nice testbed, tried to make things work >> with virtio_scsi. >> >> However, looking at the code there I've found virtscsi_tmf() just >> uses 'wait_for_completion', with no timeout specified. So in effect >> any abort might stall forever. >> >> Wouldn't it be more sensible to use 'wait_for_completion_timeout' >> here, to allow the error escalation to continue? >> This would especially be useful when running with multipathing, >> as the underlying device might stall, and aio_cancel() doesn't work >> reliably, if at all. > > Hi, > I agree that we need a timeout. bdrv_aio_cancel() is not guaranteed to > complete in bounded time. I also agree that we need a timeout, but then note that host reset could also not complete in bounded time if I/O doesn't terminate in the host. Last time I checked the io_cancel system call was basically a no-op (for aio=native), and for aio=threads the worker might stay in D state for an unbounded time too. Paolo >> Also I've found that there is no host reset. Currently the virtio >> semantics seem to require reliable communication, ie for every >> command send there _has_ to be a response. >> >> Long and painful experience with RAID HBAs has shown that this model >> works okay for the lower-level escalations, but you absolutely need >> a host reset to restore communication. >> In the case of virtio I would think that a virtio-level reset for >> host_reset would be a sensible idea. > > One thing to watch out for is that a virtio-scsi reset will likely hang > too because it resets all pending requests. > > Paolo Bonzini has done the lion's share of virtio-scsi work over the > past year (or two?). He might have some more thoughts. > > Stefan >