qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] virtio-scsi and error handling
@ 2013-06-11 11:41 Hannes Reinecke
  2013-06-12  7:56 ` Stefan Hajnoczi
  0 siblings, 1 reply; 3+ messages in thread
From: Hannes Reinecke @ 2013-06-11 11:41 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Paolo Bonzini, Alexander Graf, qemu-devel@nongnu.org

Hi Stefan,

I currently playing around with improving SCSI EH, optimizing
command aborts and the like.

And, supposing it to be a nice testbed, tried to make things work
with virtio_scsi.

However, looking at the code there I've found virtscsi_tmf() just
uses 'wait_for_completion', with no timeout specified. So in effect
any abort might stall forever.

Wouldn't it be more sensible to use 'wait_for_completion_timeout'
here, to allow the error escalation to continue?
This would especially be useful when running with multipathing,
as the underlying device might stall, and aio_cancel() doesn't work
reliably, if at all.

Also I've found that there is no host reset. Currently the virtio
semantics seem to require reliable communication, ie for every
command send there _has_ to be a response.

Long and painful experience with RAID HBAs has shown that this model
works okay for the lower-level escalations, but you absolutely need
a host reset to restore communication.
In the case of virtio I would think that a virtio-level reset for
host_reset would be a sensible idea.

Any opinions from your side?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] virtio-scsi and error handling
  2013-06-11 11:41 [Qemu-devel] virtio-scsi and error handling Hannes Reinecke
@ 2013-06-12  7:56 ` Stefan Hajnoczi
  2013-06-12 20:19   ` Paolo Bonzini
  0 siblings, 1 reply; 3+ messages in thread
From: Stefan Hajnoczi @ 2013-06-12  7:56 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Paolo Bonzini, Asias He, Alexander Graf, qemu-devel@nongnu.org

On Tue, Jun 11, 2013 at 01:41:38PM +0200, Hannes Reinecke wrote:
> I currently playing around with improving SCSI EH, optimizing
> command aborts and the like.
> 
> And, supposing it to be a nice testbed, tried to make things work
> with virtio_scsi.
> 
> However, looking at the code there I've found virtscsi_tmf() just
> uses 'wait_for_completion', with no timeout specified. So in effect
> any abort might stall forever.
> 
> Wouldn't it be more sensible to use 'wait_for_completion_timeout'
> here, to allow the error escalation to continue?
> This would especially be useful when running with multipathing,
> as the underlying device might stall, and aio_cancel() doesn't work
> reliably, if at all.

Hi,
I agree that we need a timeout.  bdrv_aio_cancel() is not guaranteed to
complete in bounded time.

> Also I've found that there is no host reset. Currently the virtio
> semantics seem to require reliable communication, ie for every
> command send there _has_ to be a response.
> 
> Long and painful experience with RAID HBAs has shown that this model
> works okay for the lower-level escalations, but you absolutely need
> a host reset to restore communication.
> In the case of virtio I would think that a virtio-level reset for
> host_reset would be a sensible idea.

One thing to watch out for is that a virtio-scsi reset will likely hang
too because it resets all pending requests.

Paolo Bonzini has done the lion's share of virtio-scsi work over the
past year (or two?).  He might have some more thoughts.

Stefan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] virtio-scsi and error handling
  2013-06-12  7:56 ` Stefan Hajnoczi
@ 2013-06-12 20:19   ` Paolo Bonzini
  0 siblings, 0 replies; 3+ messages in thread
From: Paolo Bonzini @ 2013-06-12 20:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel@nongnu.org, Asias He, Hannes Reinecke, Alexander Graf

Il 12/06/2013 03:56, Stefan Hajnoczi ha scritto:
> On Tue, Jun 11, 2013 at 01:41:38PM +0200, Hannes Reinecke wrote:
>> I currently playing around with improving SCSI EH, optimizing
>> command aborts and the like.
>>
>> And, supposing it to be a nice testbed, tried to make things work
>> with virtio_scsi.
>>
>> However, looking at the code there I've found virtscsi_tmf() just
>> uses 'wait_for_completion', with no timeout specified. So in effect
>> any abort might stall forever.
>>
>> Wouldn't it be more sensible to use 'wait_for_completion_timeout'
>> here, to allow the error escalation to continue?
>> This would especially be useful when running with multipathing,
>> as the underlying device might stall, and aio_cancel() doesn't work
>> reliably, if at all.
> 
> Hi,
> I agree that we need a timeout.  bdrv_aio_cancel() is not guaranteed to
> complete in bounded time.

I also agree that we need a timeout, but then note that host reset could
also not complete in bounded time if I/O doesn't terminate in the host.

Last time I checked the io_cancel system call was basically a no-op (for
aio=native), and for aio=threads the worker might stay in D state for an
unbounded time too.

Paolo

>> Also I've found that there is no host reset. Currently the virtio
>> semantics seem to require reliable communication, ie for every
>> command send there _has_ to be a response.
>>
>> Long and painful experience with RAID HBAs has shown that this model
>> works okay for the lower-level escalations, but you absolutely need
>> a host reset to restore communication.
>> In the case of virtio I would think that a virtio-level reset for
>> host_reset would be a sensible idea.
> 
> One thing to watch out for is that a virtio-scsi reset will likely hang
> too because it resets all pending requests.
> 
> Paolo Bonzini has done the lion's share of virtio-scsi work over the
> past year (or two?).  He might have some more thoughts.
> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-06-12 20:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-11 11:41 [Qemu-devel] virtio-scsi and error handling Hannes Reinecke
2013-06-12  7:56 ` Stefan Hajnoczi
2013-06-12 20:19   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).