From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45588) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cr6Ey-0005Lf-Ce for qemu-devel@nongnu.org; Thu, 23 Mar 2017 13:09:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cr6Ev-0000ji-8R for qemu-devel@nongnu.org; Thu, 23 Mar 2017 13:09:36 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41922) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cr6Eu-0000gQ-V8 for qemu-devel@nongnu.org; Thu, 23 Mar 2017 13:09:33 -0400 Date: Thu, 23 Mar 2017 19:09:31 +0200 From: "Michael S. Tsirkin" Message-ID: <20170323190609-mutt-send-email-mst@kernel.org> References: <20170302185942.76255-1-pasic@linux.vnet.ibm.com> <20170303132149.34e5906a.cornelia.huck@de.ibm.com> <20170303135012.17edb640.cornelia.huck@de.ibm.com> <20170306155625.5c426793.cornelia.huck@de.ibm.com> <20170306170441.2a8f8b27.cornelia.huck@de.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20170306170441.2a8f8b27.cornelia.huck@de.ibm.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 1/1] virtio: fail device if set_event_notifier fails List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Cornelia Huck Cc: Halil Pasic , qemu-devel@nongnu.org On Mon, Mar 06, 2017 at 05:04:41PM +0100, Cornelia Huck wrote: > On Mon, 6 Mar 2017 16:21:13 +0100 > Halil Pasic wrote: >=20 > > On 03/06/2017 03:56 PM, Cornelia Huck wrote: > > > On Fri, 3 Mar 2017 14:08:37 +0100 > > > Halil Pasic wrote: > > >=20 > > >> On 03/03/2017 01:50 PM, Cornelia Huck wrote: > > >>> On Fri, 3 Mar 2017 13:43:32 +0100 > > >>> Halil Pasic wrote: > > >>> > > >>>> On 03/03/2017 01:21 PM, Cornelia Huck wrote: > > >>>>> On Thu, 2 Mar 2017 19:59:42 +0100 > > >>>>> Halil Pasic wrote: > > [...] > > >> I admit, I did not investigate this thoroughly, also because the p= atch > > >> is flawed regarding multi-thread anyway. After a quick investigati= on > > >> it seems the linux guest won't auto-reset the device so the guest = should > > >> end up with a not working device. I think it's pretty likely that = the > > >> admin will check the logs if the device was important. > > >=20 > > > Thinking a bit more about this, it seems setting the device broken = is > > > not the right solution for exactly that reason. Setting the virtio > > > device broken is a way to signal the guest to 'you did something > > > broken; please reset the device and start anew' (and that's how cur= rent > > > callers use it). In our case, this is not the guest's fault. > >=20 > > Do we have something to just say stuff broken without blaming the gue= st? > > And device reset might not be that stupid at all in the given situati= on, > > if we want to save what can be saved from the perspective of the gues= t. > > (After reset stuff should work again until we hit the race again -- a= nd > > since turning ioeventfd on/off should not happen that often during no= rmal > > operation it could help limit damage suffered -- e.g. controlled shut= down). >=20 > Checking again, the spec says >=20 > DEVICE_NEEDS_RESET (64) Indicates that the device has experienced an > error from which it can=E2=80=99t recover. >=20 > Nothing about 'guest error'. >=20 > The only problem is that legacy devices don't have that state, which > means they'll have a broken device through no fault of their own. >=20 > >=20 > > >=20 > > > Maybe go back to the assert 'solution'? But I'm not sure that's eno= ugh > > > if production builds disable asserts... > > >=20 > >=20 > > I will wait a bit, maybe other virtio folks are going to have an=20 > > opinion too. > >=20 > > My concern about the assert solution is that for production it is > > either too rigorous (kill off, hopefully with a dump) or not > > enough (as you have mentioned, if NDEBUG assert does nothing). > >=20 > >=20 > > I think there are setups where a loss of device does not have to be > > fatal, and I would not like to be the one who makes it fatal (for the > > guest). >=20 > Basically, it's a host bug (and not a bug specific to a certain > device). Moving the device which was impacted to a broken state may be > a useful mitigation. >=20 > But yes, let's hear some other opinions. We don't support NDEBUG really so I think an assert is fine for now. Handling unexpected errors more gracefully is laudable but I think we want a more systematic approach than just open-coding it in this specific place. --=20 MST