From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50823)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <armbru@redhat.com>) id 1WQBbt-0005eV-OS
	for qemu-devel@nongnu.org; Wed, 19 Mar 2014 04:12:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <armbru@redhat.com>) id 1WQBbo-0004le-Hd
	for qemu-devel@nongnu.org; Wed, 19 Mar 2014 04:12:25 -0400
Received: from mx1.redhat.com ([209.132.183.28]:6157)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <armbru@redhat.com>) id 1WQBbo-0004lV-9F
	for qemu-devel@nongnu.org; Wed, 19 Mar 2014 04:12:20 -0400
From: Markus Armbruster <armbru@redhat.com>
References: <CAPM=9twJX3F+as1TuoerW1Yt-b0xw8YEf1YHa0B+MLMJBd0i_w@mail.gmail.com>
	<87lhw7rppw.fsf@rustcorp.com.au>
Date: Wed, 19 Mar 2014 09:12:15 +0100
In-Reply-To: <87lhw7rppw.fsf@rustcorp.com.au> (Rusty Russell's message of
	"Wed, 19 Mar 2014 11:04:19 +1030")
Message-ID: <87vbva6200.fsf@blackfin.pond.sub.org>
MIME-Version: 1.0
Content-Type: text/plain
Subject: Re: [Qemu-devel] virtio device error reporting best practice?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Airlie <airlied@gmail.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>

Rusty Russell <rusty@rustcorp.com.au> writes:

> Dave Airlie <airlied@gmail.com> writes:
>> So I'm looking at how best to do virtio gpu device error reporting,
>> and how to deal with illegal stuff,
>>
>> I've two levels of errors I want to support,
>>
>> a) unrecoverable or bad guest kernel programming errors,
>
> The QEMU standard approach is to exit at this point.  No, really.
>
>> b) per 3D context errors from the renderer backend,
>>
>> (b) I can easily report in an event queue and the guest kernel can in
>> theory blow away the offenders, this is how GL works with some
>> extensions,
>
> That's probably sanest.
>
>> For (a) I can expect a response from every command I put into the main
>> GPU control queue, the response should always be no error, but in some
>> cases it will be because the guest hit some host resource error, or
>> asked for something insane, (guest kernel drivers would be broken in
>> most of these cases).
>>
>> Alternately I can use the separate event queue to send async errors
>> when the guest does something bad,
>>
>> I'm also considering adding some sort of flag in config space saying
>> the device needs a reset before it will continue doing anything,
>
> I generally dislike error codes which Never Happen; it's like making
> every void function return int just in case: the caller has no idea what
> to do if it fails.
>
> The litmus test: does *your* guest handle failures other than by giving
> up on the device?  If so, sure, you need to have a sane error-reporting
> strategy.

Err, isn't this a circular argument?  No need for QEMU to report the
failure, because the guest won't handle it; no need to handle the
failure, because QEMU won't report it.

What about this: would you make your guest handle failures if they were
reported?

>> The main reason I'm considering this stuff is for security reasons if
>> the guest asks for something really illegal or crazy what should the
>> expected behaviour of the host be? (at least secure I know that).
>
> If the guest userspace can do it, don't exit.  If the kernel only, and
> it's should have known better, abort is OK.
>
> Sure that doesn't help much!

Immediate exit() or abort() denies the guest the ability to degrade
service gracefully (disable the device, cry for help and try to hobble
on), or report its brokenness ungracefully (kernel panic, crash dump).
I doubt denying that is okay unless the device is so important that
without it you can't even hope to panic.