From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50772) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a6J7L-0000o6-RE for qemu-devel@nongnu.org; Tue, 08 Dec 2015 09:19:48 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a6J7I-0001G7-Go for qemu-devel@nongnu.org; Tue, 08 Dec 2015 09:19:47 -0500 Received: from mx1.redhat.com ([209.132.183.28]:57943) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a6J7I-0001Fo-Ai for qemu-devel@nongnu.org; Tue, 08 Dec 2015 09:19:44 -0500 Date: Tue, 8 Dec 2015 14:19:39 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20151208141938.GB2593@work-vm> References: <87a8pl9hmt.fsf@blackfin.pond.sub.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87a8pl9hmt.fsf@blackfin.pond.sub.org> Subject: Re: [Qemu-devel] Error handling in realize() methods List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Markus Armbruster Cc: Peter Maydell , Paolo Bonzini , Peter Crosthwaite , qemu-devel@nongnu.org, Andreas =?iso-8859-1?Q?F=E4rber?= * Markus Armbruster (armbru@redhat.com) wrote: > In general, code running withing a realize() method should not exit() on > error. Instad, errors should be propagated through the realize() > method. Additionally, the realize() method should fail cleanly, > i.e. carefully undo its side effects such as wiring of interrupts, > mapping of memory, and so forth. Tedious work, but necessary to make > hot plug safe. > > Quite a few devices don't do that. > > Some of them can be usefully hot-plugged, and for them unclean failures > are simply bugs. I'm going to mark the ones I can find. > > Others are used only as onboard devices, and if their realize() fails, > the machine's init() will exit()[*]. In an ideal world, we'd start with > an empty board and cold-plugg devices, and there, clean failure may be > useful. In the world we live in, making these devices fail cleanly is a > lot of tedious work for no immediate gain. > > Example: machine "kzm" and device "fsl,imx31". fsl_imx31_realize() > returns without cleanup on error. kzm_init() exit(1)s when realize > fails, so the lack of cleanup is a non-issue. > > I think this is basically okay for now, but I'd like us to mark these > devices cannot_instantiate_with_device_add_yet, with /* Reason: > realize() method fails uncleanly */. > > Opinions? > > Next, let's consider the special case "out of memory". > > Our general approach is to treat it as immediately fatal. This makes > sense, because when a smallish allocation fails, the process is almost > certainly doomed anyway. Moreover, memory allocation is frequent, and > attempting to recover from failed memory allocation adds loads of > hard-to-test error paths. These are *dangerous* unless carefully tested > (and we don't). > > Certain important allocations we handle more gracefully. For instance, > we don't want to die when the user tries to hot-plug more memory than we > can allocate, or tries to open a QCOW2 image with a huge L1 table. > > Guest memory allocation used to have the "immediately fatal" policy > baked in at a fairly low level, but it's since been lifted into callers; > see commit c261d77..fc7a580 and fixups 4f96676..0bdaa3a. During review > of the latter, Peter Crosthwaite called out the &error_fatal in the > realize methods and their supporting code. I agreed with him back then > that the errors should really be propagated. But now I've changed my > mind: I think we should treat these memory allocation failures like > we've always treated them, namely report and exit(1). Except for > "large" allocations, where we have a higher probability of failure, and > a more realistic chance to recover safely. > > Can we agree that passing &error_fatal to memory_region_init_ram() & > friends is basically okay even in realize() methods and their supporting > code? I'd say it depends if they can be hotplugged; I think anything that we really want to hotplug onto real running machines (as opposed to where we're just hotplugging during setup) we should propagate errors properly. And tbh I don't buy the small allocation argument; I think we should handle them all; in my utopian world a guest wouldn't die unless there was no way out. Dave > > [*] Well, the ones that bother to check for errors, but that's a > separate problem. > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK