From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [PATCH -v2] QEMU-KVM: MCE: Relay UCR MCE to guest
Date: Thu, 17 Sep 2009 18:36:56 -0300
Message-ID: <20090917213656.GC13907@amt.cnet>
References: <1252463282.5212.44.camel@yhuang-dev.sh.intel.com> <20090916175931.GA7997@amt.cnet> <1253150009.15717.462.camel@yhuang-dev.sh.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Avi Kivity <avi@redhat.com>, Andi Kleen <andi@firstfloor.org>,
	Anthony Liguori <aliguori@us.ibm.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Huang Ying <ying.huang@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:5080 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754421AbZIQVht (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 17 Sep 2009 17:37:49 -0400
Content-Disposition: inline
In-Reply-To: <1253150009.15717.462.camel@yhuang-dev.sh.intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Thu, Sep 17, 2009 at 09:13:29AM +0800, Huang Ying wrote:
> On Thu, 2009-09-17 at 01:59 +0800, Marcelo Tosatti wrote: 
> > On Wed, Sep 09, 2009 at 10:28:02AM +0800, Huang Ying wrote:
> > > UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
> > > where some hardware error such as some memory error can be reported
> > > without PCC (processor context corrupted). To recover from such MCE,
> > > the corresponding memory will be unmapped, and all processes accessing
> > > the memory will be killed via SIGBUS.
> > > 
> > > For KVM, if QEMU/KVM is killed, all guest processes will be killed
> > > too. So we relay SIGBUS from host OS to guest system via a UCR MCE
> > > injection. Then guest OS can isolate corresponding memory and kill
> > > necessary guest processes only. SIGBUS sent to main thread (not VCPU
> > > threads) will be broadcast to all VCPU threads as UCR MCE.
> > > 
> > > v2:
> > > 
> > > - Use qemu_ram_addr_from_host instead of self made one to covert from
> > >   host address to guest RAM address. Thanks Anthony Liguori.
> > > 
> > > Signed-off-by: Huang Ying <ying.huang@intel.com>
> > > 
> > > ---
> > >  cpu-common.h      |    1 
> > >  exec.c            |   20 +++++--
> > >  qemu-kvm.c        |  154 ++++++++++++++++++++++++++++++++++++++++++++++++++----
> > >  target-i386/cpu.h |   20 ++++++-
> > >  4 files changed, 178 insertions(+), 17 deletions(-)
> > > 
> > > --- a/qemu-kvm.c
> > > +++ b/qemu-kvm.c
> > > @@ -27,10 +27,23 @@
> > >  #include <sys/mman.h>
> > >  #include <sys/ioctl.h>
> > >  #include <signal.h>
> > > +#include <sys/signalfd.h>
> > > +#include <sys/prctl.h>
> > >  
> > >  #define false 0
> > >  #define true 1
> > >  
> > > +#ifndef PR_MCE_KILL
> > > +#define PR_MCE_KILL 33
> > > +#endif
> > > +
> > > +#ifndef BUS_MCEERR_AR
> > > +#define BUS_MCEERR_AR 4
> > > +#endif
> > > +#ifndef BUS_MCEERR_AO
> > > +#define BUS_MCEERR_AO 5
> > > +#endif
> > > +
> > >  #define EXPECTED_KVM_API_VERSION 12
> > >  
> > >  #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION
> > > @@ -1507,6 +1520,37 @@ static void sig_ipi_handler(int n)
> > >  {
> > >  }
> > >  
> > > +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx)
> > > +{
> > > +    if (siginfo->ssi_code == BUS_MCEERR_AO) {
> > > +        uint64_t status;
> > > +        unsigned long paddr;
> > > +        CPUState *cenv;
> > > +
> > > +        /* Hope we are lucky for AO MCE */
> > > +        if (do_qemu_ram_addr_from_host((void *)siginfo->ssi_addr, &paddr)) {
> > > +            fprintf(stderr, "Hardware memory error for memory used by "
> > > +                    "QEMU itself instead of guest system!: %llx\n",
> > > +                    (unsigned long long)siginfo->ssi_addr);
> > > +            return;
> > 
> > qemu-kvm should die here?
> 
> There are two kinds of UCR MCE. One is triggered by user space/guest
> read/write, the other is triggered by asynchronously detected error
> (e.g. patrol scrubbing). The latter one is reported as AO (Action
> Optional) MCE, and it has nothing to do with current path. So if we are
> lucky enough, we can survive. And when we finally touch the error memory
> reported by AO MCE, another AR (Action Required) MCE will be triggered.
> We have another chance to deal with it.

OK.

> 
> > > +        }
> > > +        status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
> > > +            | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
> > > +            | 0xc0;
> > > +        kvm_inject_x86_mce(first_cpu, 9, status,
> > > +                           MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
> > > +                           (MCM_ADDR_PHYS << 6) | 0xc);
> > > +        for (cenv = first_cpu->next_cpu; cenv != NULL; cenv = cenv->next_cpu)
> > > +            kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
> > > +                               MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0);
> > > +        return;
> > 
> > Should abort if kvm_inject_x86_mce fails?
> 
> kvm_inject_x86_mce will abort by itself.

OK.

> 
> > > +    } else if (siginfo->ssi_code == BUS_MCEERR_AR)
> > > +        fprintf(stderr, "Hardware memory error!\n");
> > > +    else
> > > +        fprintf(stderr, "Internal error in QEMU!\n");
> > 
> > Can you re-raise SIGBUS so you we get a coredump on non-MCE SIGBUS as
> > usual?
> 
> We discuss this before. Copied below, please comment the comments
> below, :)
> 
> Avi:
> (also, I if we can't handle guest-mode SIGBUS I think it would be nice 
> to raise it again so the process terminates due to the SIGBUS).
> 
> Huang Ying:
> For SIGBUS we can not relay to guest as MCE, we can either abort or
> reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer
> the latter one?
> 
> Andi:
> I think a suitable error message and exit would be better than a plain 
> signal kill. It shouldn't look like qemu crashed due to a software
> bug. Ideally a error message in a way that it can be parsed by libvirt
> etc. and reported in a suitable way.
> 
> However qemu getting killed itself is very unlikely, it doesn't
> have much memory foot print compared to the guest and other data. 
> So this should be a very rare condition.
> 
> Avi:
> libvirt etc. can/should wait() for qemu to terminate abnormally and 
> report the reason why.  However it doesn't seem there is a way to get 
> extended signal information from wait(), so it looks like internal 
> handling by qemu is better.

I'm not talking about SIGBUS generated by MCE.

What i mean is, for SIGBUS signals that are not due to MCE errors, the
current behaviour is to generate a core dump (which is useful
information for debugging). 

With your patch, qemu-kvm handles the signal, prints a message before
exiting.

This is annoying. It seems the discussion above is about SIGBUS
initiated by MCE errors.