[Qemu-devel] virtio device error reporting best practice?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] virtio device error reporting best practice?
@ 2014-03-17  6:02 Dave Airlie
  2014-03-17 14:28 ` Laszlo Ersek
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Dave Airlie @ 2014-03-17  6:02 UTC (permalink / raw)
  To: qemu-devel@nongnu.org

So I'm looking at how best to do virtio gpu device error reporting,
and how to deal with illegal stuff,

I've two levels of errors I want to support,

a) unrecoverable or bad guest kernel programming errors,

b) per 3D context errors from the renderer backend,

(b) I can easily report in an event queue and the guest kernel can in
theory blow away the offenders, this is how GL works with some
extensions,

For (a) I can expect a response from every command I put into the main
GPU control queue, the response should always be no error, but in some
cases it will be because the guest hit some host resource error, or
asked for something insane, (guest kernel drivers would be broken in
most of these cases).

Alternately I can use the separate event queue to send async errors
when the guest does something bad,

I'm also considering adding some sort of flag in config space saying
the device needs a reset before it will continue doing anything,

The main reason I'm considering this stuff is for security reasons if
the guest asks for something really illegal or crazy what should the
expected behaviour of the host be? (at least secure I know that).

Dave.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17  6:02 [Qemu-devel] virtio device error reporting best practice? Dave Airlie
@ 2014-03-17 14:28 ` Laszlo Ersek
  2014-03-17 14:40   ` Peter Maydell
  2014-03-26 12:49   ` Stefan Hajnoczi
  2014-03-17 14:50 ` Gerd Hoffmann
  2014-03-19  0:34 ` Rusty Russell
  2 siblings, 2 replies; 21+ messages in thread
From: Laszlo Ersek @ 2014-03-17 14:28 UTC (permalink / raw)
  To: Dave Airlie, qemu-devel@nongnu.org; +Cc: Seiji Aguchi

On 03/17/14 07:02, Dave Airlie wrote:
> So I'm looking at how best to do virtio gpu device error reporting,
> and how to deal with illegal stuff,
> 
> I've two levels of errors I want to support,
> 
> a) unrecoverable or bad guest kernel programming errors,
> 
> b) per 3D context errors from the renderer backend,
> 
> (b) I can easily report in an event queue and the guest kernel can in
> theory blow away the offenders, this is how GL works with some
> extensions,
> 
> For (a) I can expect a response from every command I put into the main
> GPU control queue, the response should always be no error, but in some
> cases it will be because the guest hit some host resource error, or
> asked for something insane, (guest kernel drivers would be broken in
> most of these cases).
> 
> Alternately I can use the separate event queue to send async errors
> when the guest does something bad,
> 
> I'm also considering adding some sort of flag in config space saying
> the device needs a reset before it will continue doing anything,
> 
> The main reason I'm considering this stuff is for security reasons if
> the guest asks for something really illegal or crazy what should the
> expected behaviour of the host be? (at least secure I know that).

exit(1).

If you grep qemu for it, you'll find such examples. Notably,
"hw/virtio/virtio.c" is chock full of them; if the guest doesn't speak
the basic protocol, there's nothing for the host to do. See also
virtio-blk.c (missing or incorrect headers), virtio-net.c (similar
protocol violations), virtio-scsi.c (wrong header size, bad config etc).

For later, we have a use case on the horizon where all such exits -- not
just virtio, but exit(1) or abort() on invalid guest behavior in general
-- should be optionally coupled (dependent on the qemu command line)
with an automatic dump-guest-memory, in order to help debugging the guest.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:28 ` Laszlo Ersek
@ 2014-03-17 14:40   ` Peter Maydell
  2014-03-17 14:49     ` Laszlo Ersek
  2014-03-17 14:57     ` Richard W.M. Jones
  2014-03-26 12:49   ` Stefan Hajnoczi
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Maydell @ 2014-03-17 14:40 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
> On 03/17/14 07:02, Dave Airlie wrote:
>> The main reason I'm considering this stuff is for security reasons if
>> the guest asks for something really illegal or crazy what should the
>> expected behaviour of the host be? (at least secure I know that).
>
> exit(1).

No thanks -- the guest should never be able to cause QEMU
to exit (in an ideal world). Use
   qemu_log_mask(LOG_GUEST_ERROR, ...)
and continue.

> If you grep qemu for it, you'll find such examples. Notably,
> "hw/virtio/virtio.c" is chock full of them; if the guest doesn't speak
> the basic protocol, there's nothing for the host to do. See also
> virtio-blk.c (missing or incorrect headers), virtio-net.c (similar
> protocol violations), virtio-scsi.c (wrong header size, bad config etc).

I think these are all examples of legacy code written before we
had a sensible logging API for this kind of thing.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:40   ` Peter Maydell
@ 2014-03-17 14:49     ` Laszlo Ersek
  2014-03-17 14:54       ` Peter Maydell
                         ` (3 more replies)
  2014-03-17 14:57     ` Richard W.M. Jones
  1 sibling, 4 replies; 21+ messages in thread
From: Laszlo Ersek @ 2014-03-17 14:49 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

On 03/17/14 15:40, Peter Maydell wrote:
> On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
>> On 03/17/14 07:02, Dave Airlie wrote:
>>> The main reason I'm considering this stuff is for security reasons if
>>> the guest asks for something really illegal or crazy what should the
>>> expected behaviour of the host be? (at least secure I know that).
>>
>> exit(1).
> 
> No thanks -- the guest should never be able to cause QEMU
> to exit (in an ideal world). Use
>    qemu_log_mask(LOG_GUEST_ERROR, ...)
> and continue.

How do you continue with a garbled virtio ring? Say you detect an error
that would cause integer overflow or buffer overflow in the host, a
clear virtio protocol violation etc. Error reporting is just one thing
-- what are the semantics of continuing?

Exit(1) is considered guest crash. "The guest should never be able to
cause QEMU to exit" is identical to "a bare metal kernel should never be
able to crash a physical machine, even if it wants to". Some behavior is
left undefined even in hardware manuals. Crashing as early as possible
is safe and helpful.

Just my opinion, naturally.

Thanks,
Laszlo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17  6:02 [Qemu-devel] virtio device error reporting best practice? Dave Airlie
  2014-03-17 14:28 ` Laszlo Ersek
@ 2014-03-17 14:50 ` Gerd Hoffmann
  2014-03-19  0:34 ` Rusty Russell
  2 siblings, 0 replies; 21+ messages in thread
From: Gerd Hoffmann @ 2014-03-17 14:50 UTC (permalink / raw)
  To: Dave Airlie; +Cc: qemu-devel@nongnu.org

On Mo, 2014-03-17 at 16:02 +1000, Dave Airlie wrote:
> So I'm looking at how best to do virtio gpu device error reporting,
> and how to deal with illegal stuff,
> 
> I've two levels of errors I want to support,
> 
> a) unrecoverable or bad guest kernel programming errors,
> 
> b) per 3D context errors from the renderer backend,

What you find in modern real hardware (xhci for example) and which would
also make sense for virtio is:

 * stick an error message into a event queue (for non-fatal errors),
   which would be a good fit for (b)
 * set a bit in a error status register, maybe raise IRQ, stop
   processing until reset (for fatal errors, i.e. your (a) case).
   such a register can live in config space for virtio.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:49     ` Laszlo Ersek
@ 2014-03-17 14:54       ` Peter Maydell
  2014-03-17 14:57       ` Gerd Hoffmann
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Maydell @ 2014-03-17 14:54 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

On 17 March 2014 14:49, Laszlo Ersek <lersek@redhat.com> wrote:
> On 03/17/14 15:40, Peter Maydell wrote:
>> On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
>>> On 03/17/14 07:02, Dave Airlie wrote:
>>>> The main reason I'm considering this stuff is for security reasons if
>>>> the guest asks for something really illegal or crazy what should the
>>>> expected behaviour of the host be? (at least secure I know that).
>>>
>>> exit(1).
>>
>> No thanks -- the guest should never be able to cause QEMU
>> to exit (in an ideal world). Use
>>    qemu_log_mask(LOG_GUEST_ERROR, ...)
>> and continue.
>
> How do you continue with a garbled virtio ring? Say you detect an error
> that would cause integer overflow or buffer overflow in the host, a
> clear virtio protocol violation etc. Error reporting is just one thing
> -- what are the semantics of continuing?

Whatever the device likes. If you're emulating real hardware
then the ideal is to match that. For virtio, whatever seems
easiest. Something that would allow the guest to work after
it does a warm reboot would be nice.

> Exit(1) is considered guest crash. "The guest should never be able to
> cause QEMU to exit" is identical to "a bare metal kernel should never be
> able to crash a physical machine, even if it wants to". Some behavior is
> left undefined even in hardware manuals. Crashing as early as possible
> is safe and helpful.

This would mean we couldn't model systems with a hardware
watchdog timer that looks out for guest crashes and does a reboot,
for instance. exit() is approximately equivalent to "machine
caught fire" since it means that there's no resurrecting
the VM via watchdog device, forcing reset through a management
interface or anything else.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:49     ` Laszlo Ersek
  2014-03-17 14:54       ` Peter Maydell
@ 2014-03-17 14:57       ` Gerd Hoffmann
  2014-03-17 19:05       ` Andreas Färber
  2014-03-18 12:45       ` Kevin Wolf
  3 siblings, 0 replies; 21+ messages in thread
From: Gerd Hoffmann @ 2014-03-17 14:57 UTC (permalink / raw)
  To: Laszlo Ersek, Michael S. Tsirkin
  Cc: Peter Maydell, Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

On Mo, 2014-03-17 at 15:49 +0100, Laszlo Ersek wrote:
> On 03/17/14 15:40, Peter Maydell wrote:
> > On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
> >> On 03/17/14 07:02, Dave Airlie wrote:
> >>> The main reason I'm considering this stuff is for security reasons if
> >>> the guest asks for something really illegal or crazy what should the
> >>> expected behaviour of the host be? (at least secure I know that).
> >>
> >> exit(1).
> > 
> > No thanks -- the guest should never be able to cause QEMU
> > to exit (in an ideal world). Use
> >    qemu_log_mask(LOG_GUEST_ERROR, ...)
> > and continue.
> 
> How do you continue with a garbled virtio ring? Say you detect an error
> that would cause integer overflow or buffer overflow in the host, a
> clear virtio protocol violation etc. Error reporting is just one thing
> -- what are the semantics of continuing?

Stop processing until device is reset.  This is what real hardware does,
and there are places in qemu (xhci emulation for example) which does
this too.

We don't have a standard error register in virtio to report this to the
guest though.  Maybe something to consider for virtio 1.0?  mst?

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:40   ` Peter Maydell
  2014-03-17 14:49     ` Laszlo Ersek
@ 2014-03-17 14:57     ` Richard W.M. Jones
  2014-03-17 14:59       ` Richard W.M. Jones
  1 sibling, 1 reply; 21+ messages in thread
From: Richard W.M. Jones @ 2014-03-17 14:57 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Dave Airlie, Laszlo Ersek, qemu-devel@nongnu.org, Seiji Aguchi


On Mon, Mar 17, 2014 at 02:40:09PM +0000, Peter Maydell wrote:
> On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
> > On 03/17/14 07:02, Dave Airlie wrote:
> >> The main reason I'm considering this stuff is for security reasons if
> >> the guest asks for something really illegal or crazy what should the
> >> expected behaviour of the host be? (at least secure I know that).
> >
> > exit(1).
> 
> No thanks -- the guest should never be able to cause QEMU
> to exit (in an ideal world). Use
>    qemu_log_mask(LOG_GUEST_ERROR, ...)
> and continue.

Don't look too closely at the spice backend ...

https://bugzilla.redhat.com/buglist.cgi?component=spice&list_id=2320267

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:57     ` Richard W.M. Jones
@ 2014-03-17 14:59       ` Richard W.M. Jones
  0 siblings, 0 replies; 21+ messages in thread
From: Richard W.M. Jones @ 2014-03-17 14:59 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Laszlo Ersek, Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

On Mon, Mar 17, 2014 at 02:57:41PM +0000, Richard W.M. Jones wrote:
> 
> On Mon, Mar 17, 2014 at 02:40:09PM +0000, Peter Maydell wrote:
> > On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
> > > On 03/17/14 07:02, Dave Airlie wrote:
> > >> The main reason I'm considering this stuff is for security reasons if
> > >> the guest asks for something really illegal or crazy what should the
> > >> expected behaviour of the host be? (at least secure I know that).
> > >
> > > exit(1).
> > 
> > No thanks -- the guest should never be able to cause QEMU
> > to exit (in an ideal world). Use
> >    qemu_log_mask(LOG_GUEST_ERROR, ...)
> > and continue.
> 
> Don't look too closely at the spice backend ...
> 
> https://bugzilla.redhat.com/buglist.cgi?component=spice&list_id=2320267

A better link might be:
https://bugzilla.redhat.com/show_bug.cgi?id=997932#c13

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:49     ` Laszlo Ersek
  2014-03-17 14:54       ` Peter Maydell
  2014-03-17 14:57       ` Gerd Hoffmann
@ 2014-03-17 19:05       ` Andreas Färber
  2014-03-18 12:45       ` Kevin Wolf
  3 siblings, 0 replies; 21+ messages in thread
From: Andreas Färber @ 2014-03-17 19:05 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Peter Maydell, Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

Am 17.03.2014 15:49, schrieb Laszlo Ersek:
> On 03/17/14 15:40, Peter Maydell wrote:
>> On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
>>> On 03/17/14 07:02, Dave Airlie wrote:
>>>> The main reason I'm considering this stuff is for security reasons if
>>>> the guest asks for something really illegal or crazy what should the
>>>> expected behaviour of the host be? (at least secure I know that).
>>>
>>> exit(1).
>>
>> No thanks -- the guest should never be able to cause QEMU
>> to exit (in an ideal world). Use
>>    qemu_log_mask(LOG_GUEST_ERROR, ...)
>> and continue.
> 
> How do you continue with a garbled virtio ring? Say you detect an error
> that would cause integer overflow or buffer overflow in the host, a
> clear virtio protocol violation etc. Error reporting is just one thing
> -- what are the semantics of continuing?
> 
> Exit(1) is considered guest crash.

I disagree. QEMU has a panic state/event for guest crashes, used by
pvpanic and s390x, which does not exit() the process.

If it's a state that the guest should not be able to trigger, then
assertions are better than exit() since they allow to use gdb for
investigating the origin.

exit(1) should correspond to an unrecoverable issue on the host, such as
ioctl failure, not a fault by the guest. And yes, there's quite a few of
them around just like there's still tons of fprintf()s around. Error
handling and reporting is a constant pain point...

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:49     ` Laszlo Ersek
                         ` (2 preceding siblings ...)
  2014-03-17 19:05       ` Andreas Färber
@ 2014-03-18 12:45       ` Kevin Wolf
  3 siblings, 0 replies; 21+ messages in thread
From: Kevin Wolf @ 2014-03-18 12:45 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Peter Maydell, Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

Am 17.03.2014 um 15:49 hat Laszlo Ersek geschrieben:
> On 03/17/14 15:40, Peter Maydell wrote:
> > On 17 March 2014 14:28, Laszlo Ersek <lersek@redhat.com> wrote:
> >> On 03/17/14 07:02, Dave Airlie wrote:
> >>> The main reason I'm considering this stuff is for security reasons if
> >>> the guest asks for something really illegal or crazy what should the
> >>> expected behaviour of the host be? (at least secure I know that).
> >>
> >> exit(1).
> > 
> > No thanks -- the guest should never be able to cause QEMU
> > to exit (in an ideal world). Use
> >    qemu_log_mask(LOG_GUEST_ERROR, ...)
> > and continue.
> 
> How do you continue with a garbled virtio ring? Say you detect an error
> that would cause integer overflow or buffer overflow in the host, a
> clear virtio protocol violation etc. Error reporting is just one thing
> -- what are the semantics of continuing?

Stop processing the ring until the guest resets it.

You don't want a buggy guest driver to crash the whole system
needlessly, it's good enough if the one device doesn't work any more.

Also, consider that a qemu process can be doing entirely different
things, like executing block jobs or hosting the builtin NBD server (and
being a QMP server, too), and you don't want to allow the guest to
interrupt this.

As far as I can tell, exit() also doesn't do bdrv_close_all(), so it's
an unclean shutdown that may leave images in an inconsistent state.

And this is only the block layer perspective on it. Not sure about how
well other subsystems cope with unclean shutdowns.

Not to forget that a user will be surprised if his qemu process is
suddenly gone without him intentionally shutting down the VM. An error
message indicating the failing device inside the guest is much nicer.

tl;dr: exit() calls in hw/ do exist, but these are longstanding bugs,
not examples to follow.

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17  6:02 [Qemu-devel] virtio device error reporting best practice? Dave Airlie
  2014-03-17 14:28 ` Laszlo Ersek
  2014-03-17 14:50 ` Gerd Hoffmann
@ 2014-03-19  0:34 ` Rusty Russell
  2014-03-19  8:12   ` Markus Armbruster
  2014-03-20  6:51   ` Michael S. Tsirkin
  2 siblings, 2 replies; 21+ messages in thread
From: Rusty Russell @ 2014-03-19  0:34 UTC (permalink / raw)
  To: Dave Airlie, qemu-devel@nongnu.org

Dave Airlie <airlied@gmail.com> writes:
> So I'm looking at how best to do virtio gpu device error reporting,
> and how to deal with illegal stuff,
>
> I've two levels of errors I want to support,
>
> a) unrecoverable or bad guest kernel programming errors,

The QEMU standard approach is to exit at this point.  No, really.

> b) per 3D context errors from the renderer backend,
>
> (b) I can easily report in an event queue and the guest kernel can in
> theory blow away the offenders, this is how GL works with some
> extensions,

That's probably sanest.

> For (a) I can expect a response from every command I put into the main
> GPU control queue, the response should always be no error, but in some
> cases it will be because the guest hit some host resource error, or
> asked for something insane, (guest kernel drivers would be broken in
> most of these cases).
>
> Alternately I can use the separate event queue to send async errors
> when the guest does something bad,
>
> I'm also considering adding some sort of flag in config space saying
> the device needs a reset before it will continue doing anything,

I generally dislike error codes which Never Happen; it's like making
every void function return int just in case: the caller has no idea what
to do if it fails.

The litmus test: does *your* guest handle failures other than by giving
up on the device?  If so, sure, you need to have a sane error-reporting
strategy.

> The main reason I'm considering this stuff is for security reasons if
> the guest asks for something really illegal or crazy what should the
> expected behaviour of the host be? (at least secure I know that).

If the guest userspace can do it, don't exit.  If the kernel only, and
it's should have known better, abort is OK.

Sure that doesn't help much!
Rusty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-19  0:34 ` Rusty Russell
@ 2014-03-19  8:12   ` Markus Armbruster
  2014-03-20  3:40     ` Rusty Russell
  2014-03-20  6:51   ` Michael S. Tsirkin
  1 sibling, 1 reply; 21+ messages in thread
From: Markus Armbruster @ 2014-03-19  8:12 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Dave Airlie, qemu-devel@nongnu.org

Rusty Russell <rusty@rustcorp.com.au> writes:

> Dave Airlie <airlied@gmail.com> writes:
>> So I'm looking at how best to do virtio gpu device error reporting,
>> and how to deal with illegal stuff,
>>
>> I've two levels of errors I want to support,
>>
>> a) unrecoverable or bad guest kernel programming errors,
>
> The QEMU standard approach is to exit at this point.  No, really.
>
>> b) per 3D context errors from the renderer backend,
>>
>> (b) I can easily report in an event queue and the guest kernel can in
>> theory blow away the offenders, this is how GL works with some
>> extensions,
>
> That's probably sanest.
>
>> For (a) I can expect a response from every command I put into the main
>> GPU control queue, the response should always be no error, but in some
>> cases it will be because the guest hit some host resource error, or
>> asked for something insane, (guest kernel drivers would be broken in
>> most of these cases).
>>
>> Alternately I can use the separate event queue to send async errors
>> when the guest does something bad,
>>
>> I'm also considering adding some sort of flag in config space saying
>> the device needs a reset before it will continue doing anything,
>
> I generally dislike error codes which Never Happen; it's like making
> every void function return int just in case: the caller has no idea what
> to do if it fails.
>
> The litmus test: does *your* guest handle failures other than by giving
> up on the device?  If so, sure, you need to have a sane error-reporting
> strategy.

Err, isn't this a circular argument?  No need for QEMU to report the
failure, because the guest won't handle it; no need to handle the
failure, because QEMU won't report it.

What about this: would you make your guest handle failures if they were
reported?

>> The main reason I'm considering this stuff is for security reasons if
>> the guest asks for something really illegal or crazy what should the
>> expected behaviour of the host be? (at least secure I know that).
>
> If the guest userspace can do it, don't exit.  If the kernel only, and
> it's should have known better, abort is OK.
>
> Sure that doesn't help much!

Immediate exit() or abort() denies the guest the ability to degrade
service gracefully (disable the device, cry for help and try to hobble
on), or report its brokenness ungracefully (kernel panic, crash dump).
I doubt denying that is okay unless the device is so important that
without it you can't even hope to panic.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-19  8:12   ` Markus Armbruster
@ 2014-03-20  3:40     ` Rusty Russell
  2014-03-20  6:39       ` Markus Armbruster
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2014-03-20  3:40 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Dave Airlie, qemu-devel@nongnu.org

Markus Armbruster <armbru@redhat.com> writes:
> Rusty Russell <rusty@rustcorp.com.au> writes:
>> The litmus test: does *your* guest handle failures other than by giving
>> up on the device?  If so, sure, you need to have a sane error-reporting
>> strategy.
>
> Err, isn't this a circular argument?  No need for QEMU to report the
> failure, because the guest won't handle it; no need to handle the
> failure, because QEMU won't report it.
>
> What about this: would you make your guest handle failures if they were
> reported?

Perhaps I was unclear, that's what I meant.

>>> The main reason I'm considering this stuff is for security reasons if
>>> the guest asks for something really illegal or crazy what should the
>>> expected behaviour of the host be? (at least secure I know that).
>>
>> If the guest userspace can do it, don't exit.  If the kernel only, and
>> it's should have known better, abort is OK.
>>
>> Sure that doesn't help much!
>
> Immediate exit() or abort() denies the guest the ability to degrade
> service gracefully (disable the device, cry for help and try to hobble
> on), or report its brokenness ungracefully (kernel panic, crash dump).
> I doubt denying that is okay unless the device is so important that
> without it you can't even hope to panic.

Oh yes, I completely agree with you!  But QEMU practice doesn't :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-20  3:40     ` Rusty Russell
@ 2014-03-20  6:39       ` Markus Armbruster
  2014-03-20 12:53         ` Peter Maydell
  0 siblings, 1 reply; 21+ messages in thread
From: Markus Armbruster @ 2014-03-20  6:39 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Dave Airlie, qemu-devel@nongnu.org

Rusty Russell <rusty@rustcorp.com.au> writes:

> Markus Armbruster <armbru@redhat.com> writes:
>> Rusty Russell <rusty@rustcorp.com.au> writes:
>>> The litmus test: does *your* guest handle failures other than by giving
>>> up on the device?  If so, sure, you need to have a sane error-reporting
>>> strategy.
>>
>> Err, isn't this a circular argument?  No need for QEMU to report the
>> failure, because the guest won't handle it; no need to handle the
>> failure, because QEMU won't report it.
>>
>> What about this: would you make your guest handle failures if they were
>> reported?
>
> Perhaps I was unclear, that's what I meant.
>
>>>> The main reason I'm considering this stuff is for security reasons if
>>>> the guest asks for something really illegal or crazy what should the
>>>> expected behaviour of the host be? (at least secure I know that).
>>>
>>> If the guest userspace can do it, don't exit.  If the kernel only, and
>>> it's should have known better, abort is OK.
>>>
>>> Sure that doesn't help much!
>>
>> Immediate exit() or abort() denies the guest the ability to degrade
>> service gracefully (disable the device, cry for help and try to hobble
>> on), or report its brokenness ungracefully (kernel panic, crash dump).
>> I doubt denying that is okay unless the device is so important that
>> without it you can't even hope to panic.
>
> Oh yes, I completely agree with you!  But QEMU practice doesn't :)

Ah, then we're in violent agreement :)

Time to cease the practice.  Will be hard as long as the code
is chock-full of bad examples.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-19  0:34 ` Rusty Russell
  2014-03-19  8:12   ` Markus Armbruster
@ 2014-03-20  6:51   ` Michael S. Tsirkin
  2014-03-21  9:44     ` Yan Vugenfirer
  1 sibling, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2014-03-20  6:51 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Dave Airlie, qemu-devel@nongnu.org

On Wed, Mar 19, 2014 at 11:04:19AM +1030, Rusty Russell wrote:
> Dave Airlie <airlied@gmail.com> writes:
> > So I'm looking at how best to do virtio gpu device error reporting,
> > and how to deal with illegal stuff,
> >
> > I've two levels of errors I want to support,
> >
> > a) unrecoverable or bad guest kernel programming errors,
> 
> The QEMU standard approach is to exit at this point.  No, really.

It's easy on the hypervisor but often not very friendly for driver writers
who might not be qemu experts.
QEMU's moving away from exiting on errors and it would be nice
to have a robust way to report driver bugs.
How about setting VIRTIO_CONFIG_S_DEVICE_FAILED ?

Another idea that windows driver implemented is reporting
failure reason hint. They wrote it out to ISR, specifically
they notified host about watchdog timer expiration for net device
in this way.

> > b) per 3D context errors from the renderer backend,
> >
> > (b) I can easily report in an event queue and the guest kernel can in
> > theory blow away the offenders, this is how GL works with some
> > extensions,
> 
> That's probably sanest.

If it's possible to identify the offenders, I agree
a VQ is better than config space for that.
Need to make sure the queue is big enough to avoid
underrun of that queue though. Is that always possible?

> > GPU control queue, the response should always be no error, but in some
> > cases it will be because the guest hit some host resource error, or
> > asked for something insane, (guest kernel drivers would be broken in
> > most of these cases).
> >
> > Alternately I can use the separate event queue to send async errors
> > when the guest does something bad,
> >
> > I'm also considering adding some sort of flag in config space saying
> > the device needs a reset before it will continue doing anything,
> 
> I generally dislike error codes which Never Happen; it's like making
> every void function return int just in case: the caller has no idea what
> to do if it fails.
> 
> The litmus test: does *your* guest handle failures other than by giving
> up on the device?  If so, sure, you need to have a sane error-reporting
> strategy.

Right but driver development is also a valid need.

> > The main reason I'm considering this stuff is for security reasons if
> > the guest asks for something really illegal or crazy what should the
> > expected behaviour of the host be? (at least secure I know that).
> 
> If the guest userspace can do it, don't exit.  If the kernel only, and
> it's should have known better, abort is OK.

I second that, at least for now.
Maybe we will add more capabilities in virtio 1.0, or
after that.

> Sure that doesn't help much!
> Rusty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-20  6:39       ` Markus Armbruster
@ 2014-03-20 12:53         ` Peter Maydell
  2014-03-26 14:34           ` Markus Armbruster
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Maydell @ 2014-03-20 12:53 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Rusty Russell, Dave Airlie, qemu-devel@nongnu.org

On 20 March 2014 06:39, Markus Armbruster <armbru@redhat.com> wrote:
> Time to cease the practice.  Will be hard as long as the code
> is chock-full of bad examples.

I've been consistently rejecting new instances of
guest triggerable exit() or abort() in code review when
I see them for at least the last six months to a year or so
(since we improved the qemu_log() facility so it could be
used by device models too).

thanks
-- PMM

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-20  6:51   ` Michael S. Tsirkin
@ 2014-03-21  9:44     ` Yan Vugenfirer
  0 siblings, 0 replies; 21+ messages in thread
From: Yan Vugenfirer @ 2014-03-21  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, Dave Airlie, qemu-devel@nongnu.org

[-- Attachment #1: Type: text/plain, Size: 3166 bytes --]


On Mar 20, 2014, at 8:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote:

> On Wed, Mar 19, 2014 at 11:04:19AM +1030, Rusty Russell wrote:
>> Dave Airlie <airlied@gmail.com> writes:
>>> So I'm looking at how best to do virtio gpu device error reporting,
>>> and how to deal with illegal stuff,
>>> 
>>> I've two levels of errors I want to support,
>>> 
>>> a) unrecoverable or bad guest kernel programming errors,
>> 
>> The QEMU standard approach is to exit at this point.  No, really.
> 
> It's easy on the hypervisor but often not very friendly for driver writers
> who might not be qemu experts.
> QEMU's moving away from exiting on errors and it would be nice
> to have a robust way to report driver bugs.
> How about setting VIRTIO_CONFIG_S_DEVICE_FAILED ?
> 
> Another idea that windows driver implemented is reporting
> failure reason hint. They wrote it out to ISR, specifically
> they notified host about watchdog timer expiration for net device
> in this way.

I removed it for now and really would like to have an official way to bring it back.

Also going back to the original question - Windows can handle graphic cards HW errors by reloading the driver and reseting the device (stating from Vista).

> 
>>> b) per 3D context errors from the renderer backend,
>>> 
>>> (b) I can easily report in an event queue and the guest kernel can in
>>> theory blow away the offenders, this is how GL works with some
>>> extensions,
>> 
>> That's probably sanest.
> 
> If it's possible to identify the offenders, I agree
> a VQ is better than config space for that.
> Need to make sure the queue is big enough to avoid
> underrun of that queue though. Is that always possible?
> 
>>> GPU control queue, the response should always be no error, but in some
>>> cases it will be because the guest hit some host resource error, or
>>> asked for something insane, (guest kernel drivers would be broken in
>>> most of these cases).
>>> 
>>> Alternately I can use the separate event queue to send async errors
>>> when the guest does something bad,
>>> 
>>> I'm also considering adding some sort of flag in config space saying
>>> the device needs a reset before it will continue doing anything,
>> 
>> I generally dislike error codes which Never Happen; it's like making
>> every void function return int just in case: the caller has no idea what
>> to do if it fails.
>> 
>> The litmus test: does *your* guest handle failures other than by giving
>> up on the device?  If so, sure, you need to have a sane error-reporting
>> strategy.
> 
> Right but driver development is also a valid need.
> 
>>> The main reason I'm considering this stuff is for security reasons if
>>> the guest asks for something really illegal or crazy what should the
>>> expected behaviour of the host be? (at least secure I know that).
>> 
>> If the guest userspace can do it, don't exit.  If the kernel only, and
>> it's should have known better, abort is OK.
> 
> I second that, at least for now.
> Maybe we will add more capabilities in virtio 1.0, or
> after that.
> 
>> Sure that doesn't help much!
>> Rusty.


[-- Attachment #2: Type: text/html, Size: 4592 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-17 14:28 ` Laszlo Ersek
  2014-03-17 14:40   ` Peter Maydell
@ 2014-03-26 12:49   ` Stefan Hajnoczi
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2014-03-26 12:49 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Dave Airlie, qemu-devel@nongnu.org, Seiji Aguchi

On Mon, Mar 17, 2014 at 03:28:08PM +0100, Laszlo Ersek wrote:
> On 03/17/14 07:02, Dave Airlie wrote:
> > So I'm looking at how best to do virtio gpu device error reporting,
> > and how to deal with illegal stuff,
> > 
> > I've two levels of errors I want to support,
> > 
> > a) unrecoverable or bad guest kernel programming errors,
> > 
> > b) per 3D context errors from the renderer backend,
> > 
> > (b) I can easily report in an event queue and the guest kernel can in
> > theory blow away the offenders, this is how GL works with some
> > extensions,
> > 
> > For (a) I can expect a response from every command I put into the main
> > GPU control queue, the response should always be no error, but in some
> > cases it will be because the guest hit some host resource error, or
> > asked for something insane, (guest kernel drivers would be broken in
> > most of these cases).
> > 
> > Alternately I can use the separate event queue to send async errors
> > when the guest does something bad,
> > 
> > I'm also considering adding some sort of flag in config space saying
> > the device needs a reset before it will continue doing anything,
> > 
> > The main reason I'm considering this stuff is for security reasons if
> > the guest asks for something really illegal or crazy what should the
> > expected behaviour of the host be? (at least secure I know that).
> 
> exit(1).
> 
> If you grep qemu for it, you'll find such examples. Notably,
> "hw/virtio/virtio.c" is chock full of them; if the guest doesn't speak
> the basic protocol, there's nothing for the host to do. See also
> virtio-blk.c (missing or incorrect headers), virtio-net.c (similar
> protocol violations), virtio-scsi.c (wrong header size, bad config etc).
> 
> For later, we have a use case on the horizon where all such exits -- not
> just virtio, but exit(1) or abort() on invalid guest behavior in general
> -- should be optionally coupled (dependent on the qemu command line)
> with an automatic dump-guest-memory, in order to help debugging the guest.

Please don't use exit(1).  Instead you can put the device into a
"broken" state and wait for the guest to reset it.

exit(1) is nasty because a failure in one driver shouldn't bring down
the entire VM if we can prevent it.

Also, it's a denial-of-service if we ever allow virtio passthrough to
nested guests - a nested guest could kill its parent and hence all other
nested guests.

Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-20 12:53         ` Peter Maydell
@ 2014-03-26 14:34           ` Markus Armbruster
  2014-03-27  0:54             ` Venkatesh Srinivas
  0 siblings, 1 reply; 21+ messages in thread
From: Markus Armbruster @ 2014-03-26 14:34 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Rusty Russell, Dave Airlie, qemu-devel@nongnu.org

Peter Maydell <peter.maydell@linaro.org> writes:

> On 20 March 2014 06:39, Markus Armbruster <armbru@redhat.com> wrote:
>> Time to cease the practice.  Will be hard as long as the code
>> is chock-full of bad examples.
>
> I've been consistently rejecting new instances of
> guest triggerable exit() or abort() in code review when
> I see them for at least the last six months to a year or so
> (since we improved the qemu_log() facility so it could be
> used by device models too).

Appreciated.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] virtio device error reporting best practice?
  2014-03-26 14:34           ` Markus Armbruster
@ 2014-03-27  0:54             ` Venkatesh Srinivas
  0 siblings, 0 replies; 21+ messages in thread
From: Venkatesh Srinivas @ 2014-03-27  0:54 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Maydell, Rusty Russell, Dave Airlie, qemu-devel@nongnu.org

On Wed, Mar 26, 2014 at 7:34 AM, Markus Armbruster <armbru@redhat.com> wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
>
>> On 20 March 2014 06:39, Markus Armbruster <armbru@redhat.com> wrote:
>>> Time to cease the practice.  Will be hard as long as the code
>>> is chock-full of bad examples.
>>
>> I've been consistently rejecting new instances of
>> guest triggerable exit() or abort() in code review when
>> I see them for at least the last six months to a year or so
>> (since we improved the qemu_log() facility so it could be
>> used by device models too).
>
> Appreciated.

How are these errors expected to be reflected into guests? PCI aborts
could be one place to stand, for example.

-- vs;

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-03-27  0:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-17  6:02 [Qemu-devel] virtio device error reporting best practice? Dave Airlie
2014-03-17 14:28 ` Laszlo Ersek
2014-03-17 14:40   ` Peter Maydell
2014-03-17 14:49     ` Laszlo Ersek
2014-03-17 14:54       ` Peter Maydell
2014-03-17 14:57       ` Gerd Hoffmann
2014-03-17 19:05       ` Andreas Färber
2014-03-18 12:45       ` Kevin Wolf
2014-03-17 14:57     ` Richard W.M. Jones
2014-03-17 14:59       ` Richard W.M. Jones
2014-03-26 12:49   ` Stefan Hajnoczi
2014-03-17 14:50 ` Gerd Hoffmann
2014-03-19  0:34 ` Rusty Russell
2014-03-19  8:12   ` Markus Armbruster
2014-03-20  3:40     ` Rusty Russell
2014-03-20  6:39       ` Markus Armbruster
2014-03-20 12:53         ` Peter Maydell
2014-03-26 14:34           ` Markus Armbruster
2014-03-27  0:54             ` Venkatesh Srinivas
2014-03-20  6:51   ` Michael S. Tsirkin
2014-03-21  9:44     ` Yan Vugenfirer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).