From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
"Peter Crosthwaite" <crosthwaitepeter@gmail.com>,
qemu-devel@nongnu.org, "Andreas Färber" <afaerber@suse.de>,
"Paolo Bonzini" <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] Error handling in realize() methods
Date: Wed, 9 Dec 2015 10:29:32 +0000 [thread overview]
Message-ID: <20151209102931.GC2712@work-vm> (raw)
In-Reply-To: <87io480y0n.fsf@blackfin.pond.sub.org>
* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> In general, code running withing a realize() method should not exit() on
> >> error. Instad, errors should be propagated through the realize()
> >> method. Additionally, the realize() method should fail cleanly,
> >> i.e. carefully undo its side effects such as wiring of interrupts,
> >> mapping of memory, and so forth. Tedious work, but necessary to make
> >> hot plug safe.
> [...]
> >> Next, let's consider the special case "out of memory".
> >>
> >> Our general approach is to treat it as immediately fatal. This makes
> >> sense, because when a smallish allocation fails, the process is almost
> >> certainly doomed anyway. Moreover, memory allocation is frequent, and
> >> attempting to recover from failed memory allocation adds loads of
> >> hard-to-test error paths. These are *dangerous* unless carefully tested
> >> (and we don't).
> >>
> >> Certain important allocations we handle more gracefully. For instance,
> >> we don't want to die when the user tries to hot-plug more memory than we
> >> can allocate, or tries to open a QCOW2 image with a huge L1 table.
> >>
> >> Guest memory allocation used to have the "immediately fatal" policy
> >> baked in at a fairly low level, but it's since been lifted into callers;
> >> see commit c261d77..fc7a580 and fixups 4f96676..0bdaa3a. During review
> >> of the latter, Peter Crosthwaite called out the &error_fatal in the
> >> realize methods and their supporting code. I agreed with him back then
> >> that the errors should really be propagated. But now I've changed my
> >> mind: I think we should treat these memory allocation failures like
> >> we've always treated them, namely report and exit(1). Except for
> >> "large" allocations, where we have a higher probability of failure, and
> >> a more realistic chance to recover safely.
> >>
> >> Can we agree that passing &error_fatal to memory_region_init_ram() &
> >> friends is basically okay even in realize() methods and their supporting
> >> code?
> >
> > I'd say it depends if they can be hotplugged; I think anything that we really
> > want to hotplug onto real running machines (as opposed to where we're just
> > hotplugging during setup) we should propagate errors properly.
> >
> > And tbh I don't buy the small allocation argument; I think we should handle them
> > all; in my utopian world a guest wouldn't die unless there was no way out.
>
> I guess in Utopia nobody ever makes stupid coding mistakes, the error
> paths are all covered by a comprehensive test suite, and they make the
> code prettier, too. Oh, and kids always eat their vegetables without
> complaint.
Yes, it's lovely.
> However, we don't actually live in Utopia. In our world, error paths
> clutter the code, are full of bugs, and the histogram of their execution
> counts in testing (automated or not) has a frighteningly tall bar at
> zero.
>
> We're not going to make this problem less severe by making it bigger.
> In fact, we consciously decided to hack off a big chunk with an axe:
>
> commit 8a1d02aba9f986ca03d854184cd432ee98bcd179
> Author: aliguori <aliguori@c046a42c-6fe2-441c-8c8c-71466251a162>
> Date: Thu Feb 5 22:05:49 2009 +0000
>
> Terminate emulation on memory allocation failure (Avi Kivity)
>
> Memory allocation failures are a very rare condition on virtual-memory
> hosts. They are also very difficult to handle correctly (especially in a
> hardware emulation context). Because of this, it is better to gracefully
> terminate emulation rather than executing untested or even unwritten recovery
> code paths.
>
> This patch changes the qemu memory allocation routines to terminate emulation
> if an allocation failure is encountered.
>
> Signed-off-by: Avi Kivity <avi@redhat.com>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>
>
> git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@6526 c046a42c-6fe2-441c-8c8c-71466251a162
>
> Let me elaborate a bit on Avi's arguments:
>
> * Memory allocations are very, very common. I count at least 2500,
> Memory allocation failure is easily the most common *potential* error,
> both statically and dynamically.
>
> * Error recovery is always tedious and often hard. Especially when the
> error happens in the middle of a complex operation that needs to be
> fully rolled back to recover. A sensible approach is to acquire
> resources early, when recovery is still relatively easy, but that's
> often impractical for memory. This together with the ubiquitousness
> of memory allocation makes memory allocation failure even harder to
> handle than other kinds of errors.
>
> * Not all allocations are equal. When an attempt to allocate a gigabyte
> fails gracefully, there's a good chance that ordinary code can go on
> allocating and freeing kilobytes as usual. But when an attempt to
> allocate kilobytes fails, it's very likely that handling this failure
> gracefully will only lead to another one, and another one, until some
> buggy error handling puts the doomed process out of its misery.
>
> Out of the ~2400 memory allocations that go through GLib, 59 can fail.
> The others all terminate the process.
>
> * How often do we see failures from these other 2300+? Bug reports from
> users? As far as I can see, they're vanishingly rare.
>
> * Reviewing and testing the error handling chains rooted at 59
> allocations is hard enough, and I don't think we're doing particularly
> well on the testing. What chance would we have with 2300+ more?
>
> 2300+ instances of a vanishingly rare error with generally complex
> error handling and basically no test coverage: does that sound more
> useful than 2300+ instances of a vanishingly rare error that kills the
> process? If yes, how much of our limited resources is the difference
> worth?
>
> * You might argue that we don't have to handle all 2300+ instances, only
> the ones reachable from hotplug. I think that's a pipe dream. Once
> you permit "terminate on memory allocation failure" in general purpose
> code, it hides behind innocent-looking function calls. Even if we
> could find them all, we'd still have to add memory allocation failure
> handling to lots of general purpose code just to make it usable for
> hot plug. And then we'd get to keep finding them all forever.
I didn't say it was easy :-)
> I don't think handling all memory allocation failures gracefully
> everywhere or even just within hotplug is practical this side of Utopia.
> I believe all we could achieve trying is an illusion of graceful
> handling that is sufficiently convincing to let us pat on our backs,
> call the job done, and go back to working on stuff that matters to
> actual users.
Handling them all probably isn't; handling some well defined cases is
probably possible.
Avi's argument is 6 years old, I suggest a few things have changed in that
time:
a) We now use the Error** mechanism in a lot of places - so a lot of
code already is supposed to deal with a function call failing; if a function
already has an error return and the caller deals with it, then making
the function deal with an allocation error and the caller handle it is
a lot easier.
b) The use of hotplug is now common - people really hate it when their
nice, happy working VM dies when they try and do something to it, like
hotplug or migrate.
c) I suspect (but don't know) that people are pushing the number of VMs on
a host harder than they used to, but there again memory got cheap.
I'm not that eager to protect every allocation; but in some common
cases, where we already have Error** paths and it's relatively simple,
then I think we should.
(OK, to be honest I think we should protect every allocation - but I do
have sympathy with the complexity/testing arguments).
Dave
> My current working assumption is that passing &error_fatal to
> memory_region_init_ram() & friends is okay even in realize() methods and
> their supporting code, except when the allocation can be large. Even
> then, &error_fatal is better than buggy recovery code (which I can see
> all over the place, but that's a separate topic).
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2015-12-09 10:29 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-08 13:47 [Qemu-devel] Error handling in realize() methods Markus Armbruster
2015-12-08 14:19 ` Dr. David Alan Gilbert
2015-12-09 9:30 ` Markus Armbruster
2015-12-09 10:29 ` Dr. David Alan Gilbert [this message]
2015-12-09 11:10 ` Laszlo Ersek
2015-12-10 9:22 ` Markus Armbruster
2015-12-10 11:10 ` Laszlo Ersek
2015-12-09 11:47 ` Peter Maydell
2015-12-09 12:25 ` Laszlo Ersek
2015-12-09 13:21 ` Dr. David Alan Gilbert
2015-12-10 9:27 ` Markus Armbruster
2015-12-09 13:09 ` Paolo Bonzini
2015-12-09 13:12 ` Dr. David Alan Gilbert
2015-12-09 13:43 ` Paolo Bonzini
2015-12-10 11:06 ` Markus Armbruster
2015-12-10 11:21 ` Dr. David Alan Gilbert
2015-12-10 11:22 ` Paolo Bonzini
2015-12-10 11:26 ` Paolo Bonzini
2015-12-10 12:25 ` Markus Armbruster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151209102931.GC2712@work-vm \
--to=dgilbert@redhat.com \
--cc=afaerber@suse.de \
--cc=armbru@redhat.com \
--cc=crosthwaitepeter@gmail.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).