From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33569) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Umfur-0004xm-H2 for qemu-devel@nongnu.org; Wed, 12 Jun 2013 03:56:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Umfuq-0002VE-6E for qemu-devel@nongnu.org; Wed, 12 Jun 2013 03:56:25 -0400 Received: from mail-ea0-x236.google.com ([2a00:1450:4013:c01::236]:64335) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Umfup-0002V7-Vu for qemu-devel@nongnu.org; Wed, 12 Jun 2013 03:56:24 -0400 Received: by mail-ea0-f182.google.com with SMTP id d10so2924781eaj.27 for ; Wed, 12 Jun 2013 00:56:23 -0700 (PDT) Date: Wed, 12 Jun 2013 09:56:20 +0200 From: Stefan Hajnoczi Message-ID: <20130612075620.GD946@stefanha-thinkpad.muc.redhat.com> References: <51B70CF2.1020306@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51B70CF2.1020306@suse.de> Subject: Re: [Qemu-devel] virtio-scsi and error handling List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hannes Reinecke Cc: Paolo Bonzini , Asias He , Alexander Graf , "qemu-devel@nongnu.org" On Tue, Jun 11, 2013 at 01:41:38PM +0200, Hannes Reinecke wrote: > I currently playing around with improving SCSI EH, optimizing > command aborts and the like. > > And, supposing it to be a nice testbed, tried to make things work > with virtio_scsi. > > However, looking at the code there I've found virtscsi_tmf() just > uses 'wait_for_completion', with no timeout specified. So in effect > any abort might stall forever. > > Wouldn't it be more sensible to use 'wait_for_completion_timeout' > here, to allow the error escalation to continue? > This would especially be useful when running with multipathing, > as the underlying device might stall, and aio_cancel() doesn't work > reliably, if at all. Hi, I agree that we need a timeout. bdrv_aio_cancel() is not guaranteed to complete in bounded time. > Also I've found that there is no host reset. Currently the virtio > semantics seem to require reliable communication, ie for every > command send there _has_ to be a response. > > Long and painful experience with RAID HBAs has shown that this model > works okay for the lower-level escalations, but you absolutely need > a host reset to restore communication. > In the case of virtio I would think that a virtio-level reset for > host_reset would be a sensible idea. One thing to watch out for is that a virtio-scsi reset will likely hang too because it resets all pending requests. Paolo Bonzini has done the lion's share of virtio-scsi work over the past year (or two?). He might have some more thoughts. Stefan