From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=50281 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1P3sPU-0005ir-4f for qemu-devel@nongnu.org; Thu, 07 Oct 2010 11:29:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1P3sJx-0004gD-FB for qemu-devel@nongnu.org; Thu, 07 Oct 2010 11:23:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33338) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1P3sJx-0004g5-4d for qemu-devel@nongnu.org; Thu, 07 Oct 2010 11:23:49 -0400 Message-ID: <4CADE5FF.8060605@redhat.com> Date: Thu, 07 Oct 2010 10:23:43 -0500 From: Dean Nelson MIME-Version: 1.0 References: <20101004185447.891324545@redhat.com> <20101004185715.167557459@redhat.com> <4CABD7CC.6030909@jp.fujitsu.com> <20101006160531.GB4277@amt.cnet> <4CACBB94.10200@redhat.com> <4CAD417B.7060808@jp.fujitsu.com> In-Reply-To: <4CAD417B.7060808@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hidetoshi Seto Cc: Marcelo Tosatti , qemu-devel@nongnu.org, kvm@vger.kernel.org, Huang Ying On 10/06/2010 10:41 PM, Hidetoshi Seto wrote: > (2010/10/07 3:10), Dean Nelson wrote: >> On 10/06/2010 11:05 AM, Marcelo Tosatti wrote: >>> On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote: >>>> I got some more question: >>>> >>>> (2010/10/05 3:54), Marcelo Tosatti wrote: >>>>> Index: qemu/target-i386/cpu.h >>>>> =================================================================== >>>>> --- qemu.orig/target-i386/cpu.h >>>>> +++ qemu/target-i386/cpu.h >>>>> @@ -250,16 +250,32 @@ >>>>> #define PG_ERROR_RSVD_MASK 0x08 >>>>> #define PG_ERROR_I_D_MASK 0x10 >>>>> >>>>> -#define MCG_CTL_P (1UL<<8) /* MCG_CAP register available */ >>>>> +#define MCG_CTL_P (1ULL<<8) /* MCG_CAP register available */ >>>>> +#define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ >>>>> >>>>> -#define MCE_CAP_DEF MCG_CTL_P >>>>> +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) >>>>> #define MCE_BANKS_DEF 10 >>>>> >>>> >>>> It seems that current kvm doesn't support SER_P, so injecting SRAO >>>> to guest will mean that guest receives VAL|UC|!PCC and RIPV event >>>> from virtual processor that doesn't have SER_P. >>> >>> Dean also noted this. I don't think it was deliberate choice to not >>> expose SER_P. Huang? >> >> In my testing, I found that MCG_SER_P was not being set (and I was >> running on a Nehalem-EX system). Injecting a MCE resulted in the >> guest entering into panic() from mce_panic(). If crash_kexec() >> finds a kexec_crash_image the system ends up rebooting, otherwise, >> what happens next requires operator intervention. > > Good to know. > What I'm concerning is that if memory scrubbing SRAO event is > injected when !SER_P, linux guest with certain mce tolerant level > might grade it as "UC" severity and continue running with none of > panicking, killing and poisoning because of !PCC and RIPV. > > Could you provide the panic message of the guest in your test? > I think it can tell me why the mce handler decided to go panic. Sure, I'll add the info below at the end of this email. >> When I applied a patch to the guest's kernel which forces mce_ser to be >> set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found >> that when the memory page was 'owned' by a guest process, the process >> would be killed (if the page was dirty), and the guest would stay >> running. The HWPoisoned page would be sidelined and not cause any more >> issues. > > Excellent. > So while guest kernel knows which page is poisoned, guest processes > are controlled not to touch the page. > > ... Therefore rebooting the vm and renewing kernel will lost the > information where is poisoned. Correct. >>>> I think most OSes don't expect that it can receives MCE with !PCC >>>> on traditional x86 processor without SER_P. >>>> >>>> Q1: Is it safe to expect that guests can handle such !PCC event? >> >> This might be best answered by Huang, but as I mentioned above, without >> MCG_SER_P being set, the result was an orderly system panic on the >> guest. > > Though I'll wait Huang (I think he is on holiday), I believe that > system panic is just a possible option for AO (Action Optional) > event, no matter how the SER_P is. I think you may be correct, but Huang will know for sure. >>>> Q2: What is the expected behavior on the guest? >> >> I think I answered this above. > > Yeah, thanks. > >> >>>> Q3: What happen if guest reboots itself in response to the MCE? >> >> That depends... >> >> And the following issue also holds for a guest that is rebooted at >> some point having successfully sidelined the bad page. >> >> After the guest has panic'd, a system_reset of the guest or a restart >> initiated by crash_kexec() (called by panic() on the guest), usually >> results in the guest hanging because the bad page still belongs >> to qemu-kvm and is now being referenced by the new guest in some way. > > Yes. In other words my concern about reboot is that new guest kernel > including kdump kernel might try to read the bad page. If there is > no AR-SIGBUS etc., we need some tricks to inhibit such accesses. Agreed. >> (It actually may not hang, but successfully reboot and be runnable, >> with the bad page lurking in the background. It all seems to depend on >> where the bad page ends up, and whether it's ever referenced.) > > I know some tough guys using their PC with buggy DIMMs :-) > >> >> I believe there was an attempt to deal with this in kvm on the host. >> See kvm_handle_bad_page(). This function was suppose to result in the >> sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm >> which in theory would result in the right thing happening. But commit >> 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being >> sent. So this mechanism needs to be re-worked, and the issue remains. > > Definitely. > I guess Huang has some plan or hint for rework this point. Yeah, as far as I know Huang is looking into this. >> I would think that if the the bad page can't be sidelined, such that >> the newly booting guest can't use it, then the new guest shouldn't be >> allowed to boot. But perhaps there is some merit in letting it try to >> boot and see if one gets 'lucky'. > > In case of booting a real machine in real world, hardware and firmware > usually (or often) do self-test before passing control to OS. > Some platform can boot OS with degraded configuration (for example, > fewer memory) if it has trouble on its component. Some BIOS may > stop booting and show messages like "please reseat [component]" on the > screen. So we could implement/request qemu to have such mechanism. > > I can understand the merit you mentioned here, in some degree. But I > think it is hard to say "unlucky" to customer in business... I totally agree. >> I understand that Huang is looking into what should be done. He can >> give you better information than I in answer to your questions. > > Agreed. Thank you very much! You're welcome. Dean > Thanks, > H.Seto :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: The test I'm running is the mce-test suite's kvm test. A portion of the messages it outputted (to stdout) follows: > Guest physical address is 0x71220000 > Host virtual address is 7f9dc5020 > Host physical address is 0x1051620000 > Guest physical klog address is 0x71220 And it called mce-inject with the following data file: > [root@intel-s3e36-02 test]# cat SRAO > CPU 0 BANK 2 > STATUS UNCORRECTED SRAO 0x17a > MCGSTATUS MCIP RIPV > MISC 0x8c > ADDR 0x1051620000 > [root@intel-s3e36-02 test]# The following is from the host's /var/log/messages: > Oct 7 09:42:48 intel-s3e36-02 kernel: Triggering MCE exception on CPU 0 > Oct 7 09:42:48 intel-s3e36-02 kernel: Machine check events logged > Oct 7 09:42:48 intel-s3e36-02 kernel: MCE exception done on CPU 0 > Oct 7 09:42:48 intel-s3e36-02 kernel: MCE 0x1051620: Killing qemu-system-x86:6867 early due to hardware memory corruption > Oct 7 09:42:48 intel-s3e36-02 kernel: MCE 0x1051620: dirty LRU page recovery: Recovered Lastly, the following is a screen grab from the guest's serial console: > HARDWARE ERROR > CPU 0: Machine Check Exception: 5 Bank 9: bd000000000000c0 > RIP !INEXACT! 33:<0000000000400428> > TSC 17a67acd14 ADDR 71220000 MISC 8c > PROCESSOR 0:6d3 TIME 1286458966 SOCKET 0 APIC 0 > No human readable MCE decoding support on this CPU type. > Run the message through 'mcelog --ascii' to decode. > This is not a software problem! > Machine check: Uncorrected > Kernel panic - not syncing: Fatal machine check on current CPU > Pid:1493, comm: simple_process Tainted: B M ---------------- 2.6.32.dnelson_test #48 > > Call Trace: > <#MC> [] panic+0x78/0x137 > [] mce_panic+0x1e2/0x210 > [] do_machine_check+0x843/0xa70 > [] machine_check+0x1c/0x30 > <>