Re: Logging, abnormal cases, ... - Dr. David Alan Gilbert

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Michael Tokarev <mjt@tls.msk.ru>
Cc: qemu-devel <qemu-devel@nongnu.org>
Subject: Re: Logging, abnormal cases, ...
Date: Thu, 17 Dec 2020 20:34:58 +0000	[thread overview]
Message-ID: <20201217203458.GQ4117@work-vm> (raw)
In-Reply-To: <5517eec6-1b0f-7c36-c1a7-61bd24bfb6cd@msgid.tls.msk.ru>

* Michael Tokarev (mjt@tls.msk.ru) wrote:
> Hi!
> 
> I want to bring a topic which is mostly neglected in qemu but it is
> one of very important, especially for serious usage of qemu.
> 
> This is about logging of various events or even some unexpected events
> coming from guest.
> 
> Let's see for example a very easy to understand situation: a virtual
> disk image of a VM runs out of space _on the host_, so that qemu does
> not have space to write newly allocated blocks to.  What does it do?
> By default it pauses the VM and just sits there waiting. But does it
> lets someone to know that it run onto something which it can't handle?
> Does it write at least some information about it somewhere?
> 
> I know this very situation should not happen in any serious environment
> to start with, but everything happens, and the should ring a very loud
> bell somewhere so that people can react sooner than later after numerous
> attempt to _understand_ what's going on.
> 
> What I'm talking about is that we should have some standard way of
> logging such events, at least, or a way of notifying something about these.
> Including for example start/stop, suspend/resume, migration out/migration
> in etc - while these might look like a job of some management layer on
> top of qemu (like libvirt), it's still better do that in qemu so that
> all possible tools using it will benefit.
> 
> But it is not restricted to such global events like outlined already,
> there are much more interesting cases there. For another example,
> there's a lot of places where qemu checks validity of various requests
> coming to virtual hardware it emulates, - so that DMA is done to a
> valid address, register holds valid index, etc etc. A majority of
> these checks results in a simple error return or even abort().
> But it'd be very interesting to be able to log these events, maybe
> with some context, and be able to turn that logging on/off at runtime.
> 
> The thing is - there ara at least 2 types of guests out there, one
> is possible "hostile" - this is what all the recent CVEs are about,
> it is possible to abuse qemu in one way or another by doing some
> strange things within a guest. And another is entirely different -
> a "friendly" guest which is here just for easy management, so that
> various migrations, expansions and reconfigurations are possible.
> And there, it should not do evil things, and if it does, we'd better
> know about that, as it is a possible bug which might turn into a
> smoking gun one day.
> 
> The other day we had a quite bad system crash in a VMWare environment
> which resulted in a massive data loss, interruption of service and
> lots of really hard work for everyone.  A crash which actually started
> after vmware side discovered that it does not have any place where to
> put a new write for a (production, "friendly") guest. After that issue
> were fixed, there were several more conditions _like_ this happened, -
> not with lack of free space, but with the same freezing, vmware stopped
> the guest for a few seconds several times, and I've seen brief change
> in the guest's icon color (turning to red) in the monitor at least twice,
> during these freezes. But there's nothing in the logs - *why* it stopped
> the guest, what happened?? - nothing's left.  And finally it crashed
> hard after live-migrating the VM into another host with larger capacity
> storage - at the moment the migration finished, the VM freeze for a few
> seconds, there was tons of kernel messages about stuck processes and
> kernel threads after that, and, among other things, there were a few
> error messages about failed writes into the (virtual) storage. Which,
> in turn, hit some bugs in oracle and finally the db become corrupt and
> oracle instance refused to open it in any way, crashing with sigsegvs
> at startup, so we had to resort to the harder way.

I'm glad it's not just me who has hard migration problems!

> The thing was that there was absolutely nothing on the vmware side,
> even the fact of migration weren't obvious in its logs. I bet there
> were some inconsistences within data request handling from the guest
> at least, or if system clocks were different this could have been
> logged too, or anything else not-so-expected. But there's nothing,
> so we don't sill know what happened. Sure it is a bug somewhere,
> maybe several bugs at once (vmware tools are running in the vm and
> the hypervisor surely can assume a safe state of the vm while doing
> last steps of the migration, oracle should not sigsegv on bad blocks,
> and so on). But if there were logs, we had at least some chance to
> debug and maybe fix it, or to work around this if fixing isn't possible.
> But there's nothing.
> 
> I think we can do much better in qemu.
> 
> Comments?

We've got lots of things, the problem is deciding what to use when.
We can:
  a) Send QMP events - that's pretty good for a 'hey I've stopped, disk
full'
  b) We've got a logging mechanism, but I think it's rarely used.
  c) We've got trace events - that's good for fine debug but not
what you want all the time.

For migration, if something fails, I always try and print to stderr
so I get it in the /var/log/libvirt/qemu/... log;  the problem is
that only happens if you know you've screwed up - if you corrupt some
state during the migrate, then a hung guest is often the outcome.

Dave

> Thanks,
> 
> /mjt
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

     prev parent reply	other threads:[~2020-12-17 20:38 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-17 18:21 Logging, abnormal cases, Michael Tokarev
2020-12-17 20:34 ` Dr. David Alan Gilbert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201217203458.GQ4117@work-vm \
    --to=dgilbert@redhat.com \
    --cc=mjt@tls.msk.ru \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.