From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=50281 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1P3sPU-0005ir-4f
	for qemu-devel@nongnu.org; Thu, 07 Oct 2010 11:29:37 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dnelson@redhat.com>) id 1P3sJx-0004gD-FB
	for qemu-devel@nongnu.org; Thu, 07 Oct 2010 11:23:51 -0400
Received: from mx1.redhat.com ([209.132.183.28]:33338)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dnelson@redhat.com>) id 1P3sJx-0004g5-4d
	for qemu-devel@nongnu.org; Thu, 07 Oct 2010 11:23:49 -0400
Message-ID: <4CADE5FF.8060605@redhat.com>
Date: Thu, 07 Oct 2010 10:23:43 -0500
From: Dean Nelson <dnelson@redhat.com>
MIME-Version: 1.0
References: <20101004185447.891324545@redhat.com>
	<20101004185715.167557459@redhat.com>
	<4CABD7CC.6030909@jp.fujitsu.com>
	<20101006160531.GB4277@amt.cnet> <4CACBB94.10200@redhat.com>
	<4CAD417B.7060808@jp.fujitsu.com>
In-Reply-To: <4CAD417B.7060808@jp.fujitsu.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel@nongnu.org, kvm@vger.kernel.org, Huang Ying <ying.huang@intel.com>

On 10/06/2010 10:41 PM, Hidetoshi Seto wrote:
> (2010/10/07 3:10), Dean Nelson wrote:
>> On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:
>>> On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
>>>> I got some more question:
>>>>
>>>> (2010/10/05 3:54), Marcelo Tosatti wrote:
>>>>> Index: qemu/target-i386/cpu.h
>>>>> ===================================================================
>>>>> --- qemu.orig/target-i386/cpu.h
>>>>> +++ qemu/target-i386/cpu.h
>>>>> @@ -250,16 +250,32 @@
>>>>>    #define PG_ERROR_RSVD_MASK 0x08
>>>>>    #define PG_ERROR_I_D_MASK  0x10
>>>>>
>>>>> -#define MCG_CTL_P    (1UL<<8)   /* MCG_CAP register available */
>>>>> +#define MCG_CTL_P    (1ULL<<8)   /* MCG_CAP register available */
>>>>> +#define MCG_SER_P    (1ULL<<24) /* MCA recovery/new status bits */
>>>>>
>>>>> -#define MCE_CAP_DEF    MCG_CTL_P
>>>>> +#define MCE_CAP_DEF    (MCG_CTL_P|MCG_SER_P)
>>>>>    #define MCE_BANKS_DEF    10
>>>>>
>>>>
>>>> It seems that current kvm doesn't support SER_P, so injecting SRAO
>>>> to guest will mean that guest receives VAL|UC|!PCC and RIPV event
>>>> from virtual processor that doesn't have SER_P.
>>>
>>> Dean also noted this. I don't think it was deliberate choice to not
>>> expose SER_P. Huang?
>>
>> In my testing, I found that MCG_SER_P was not being set (and I was
>> running on a Nehalem-EX system). Injecting a MCE resulted in the
>> guest entering into panic() from mce_panic(). If crash_kexec()
>> finds a kexec_crash_image the system ends up rebooting, otherwise,
>> what happens next requires operator intervention.
>
> Good to know.
> What I'm concerning is that if memory scrubbing SRAO event is
> injected when !SER_P, linux guest with certain mce tolerant level
> might grade it as "UC" severity and continue running with none of
> panicking, killing and poisoning because of !PCC and RIPV.
>
> Could you provide the panic message of the guest in your test?
> I think it can tell me why the mce handler decided to go panic.

Sure, I'll add the info below at the end of this email.


>> When I applied a patch to the guest's kernel which forces mce_ser to be
>> set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
>> that when the memory page was 'owned' by a guest process, the process
>> would be killed (if the page was dirty), and the guest would stay
>> running. The HWPoisoned page would be sidelined and not cause any more
>> issues.
>
> Excellent.
> So while guest kernel knows which page is poisoned, guest processes
> are controlled not to touch the page.
>
> ... Therefore rebooting the vm and renewing kernel will lost the
> information where is poisoned.

Correct.


>>>> I think most OSes don't expect that it can receives MCE with !PCC
>>>> on traditional x86 processor without SER_P.
>>>>
>>>> Q1: Is it safe to expect that guests can handle such !PCC event?
>>
>> This might be best answered by Huang, but as I mentioned above, without
>> MCG_SER_P being set, the result was an orderly system panic on the
>> guest.
>
> Though I'll wait Huang (I think he is on holiday), I believe that
> system panic is just a possible option for AO (Action Optional)
> event, no matter how the SER_P is.

I think you may be correct, but Huang will know for sure.


>>>> Q2: What is the expected behavior on the guest?
>>
>> I think I answered this above.
>
> Yeah, thanks.
>
>>
>>>> Q3: What happen if guest reboots itself in response to the MCE?
>>
>> That depends...
>>
>> And the following issue also holds for a guest that is rebooted at
>> some point having successfully sidelined the bad page.
>>
>> After the guest has panic'd, a system_reset of the guest or a restart
>> initiated by crash_kexec() (called by panic() on the guest), usually
>> results in the guest hanging because the bad page still belongs
>> to qemu-kvm and is now being referenced by the new guest in some way.
>
> Yes. In other words my concern about reboot is that new guest kernel
> including kdump kernel might try to read the bad page.  If there is
> no AR-SIGBUS etc., we need some tricks to inhibit such accesses.

Agreed.


>> (It actually may not hang, but successfully reboot and be runnable,
>> with the bad page lurking in the background. It all seems to depend on
>> where the bad page ends up, and whether it's ever referenced.)
>
> I know some tough guys using their PC with buggy DIMMs :-)
>
>>
>> I believe there was an attempt to deal with this in kvm on the host.
>> See kvm_handle_bad_page(). This function was suppose to result in the
>> sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm
>> which in theory would result in the right thing happening. But commit
>> 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being
>> sent. So this mechanism needs to be re-worked, and the issue remains.
>
> Definitely.
> I guess Huang has some plan or hint for rework this point.

Yeah, as far as I know Huang is looking into this.


>> I would think that if the the bad page can't be sidelined, such that
>> the newly booting guest can't use it, then the new guest shouldn't be
>> allowed to boot. But perhaps there is some merit in letting it try to
>> boot and see if one gets 'lucky'.
>
> In case of booting a real machine in real world, hardware and firmware
> usually (or often) do self-test before passing control to OS.
> Some platform can boot OS with degraded configuration (for example,
> fewer memory) if it has trouble on its component.  Some BIOS may
> stop booting and show messages like "please reseat [component]" on the
> screen.  So we could implement/request qemu to have such mechanism.
>
> I can understand the merit you mentioned here, in some degree. But I
> think it is hard to say "unlucky" to customer in business...

I totally agree.


>> I understand that Huang is looking into what should be done. He can
>> give you better information than I in answer to your questions.
>
> Agreed. Thank you very much!

You're welcome.

Dean

> Thanks,
> H.Seto


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The test I'm running is the mce-test suite's kvm test. A portion of
the messages it outputted (to stdout) follows:

> Guest physical address is 0x71220000
> Host virtual address is 7f9dc5020
> Host physical address is 0x1051620000
> Guest physical klog address is 0x71220

And it called mce-inject with the following data file:

> [root@intel-s3e36-02 test]# cat SRAO
> CPU 0 BANK 2
> STATUS UNCORRECTED SRAO 0x17a
> MCGSTATUS MCIP RIPV
> MISC 0x8c
> ADDR 0x1051620000
> [root@intel-s3e36-02 test]#

The following is from the host's /var/log/messages:

> Oct  7 09:42:48 intel-s3e36-02 kernel: Triggering MCE exception on CPU 0
> Oct  7 09:42:48 intel-s3e36-02 kernel: Machine check events logged
> Oct  7 09:42:48 intel-s3e36-02 kernel: MCE exception done on CPU 0
> Oct  7 09:42:48 intel-s3e36-02 kernel: MCE 0x1051620: Killing qemu-system-x86:6867 early due to hardware memory corruption
> Oct  7 09:42:48 intel-s3e36-02 kernel: MCE 0x1051620: dirty LRU page recovery: Recovered

Lastly, the following is a screen grab from the guest's serial console:

> HARDWARE ERROR
> CPU 0: Machine Check Exception:                5 Bank 9: bd000000000000c0
> RIP !INEXACT! 33:<0000000000400428>
> TSC 17a67acd14 ADDR 71220000 MISC 8c
> PROCESSOR 0:6d3 TIME 1286458966 SOCKET 0 APIC 0
> No human readable MCE decoding support on this CPU type.
> Run the message through 'mcelog --ascii' to decode.
> This is not a software problem!
> Machine check: Uncorrected
> Kernel panic - not syncing: Fatal machine check on current CPU
> Pid:1493, comm: simple_process Tainted: B   M        ----------------  2.6.32.dnelson_test #48
>
> Call Trace:
>  <#MC>  [<ffffffff814c7c8d>] panic+0x78/0x137
>  [<ffffffff81027382>] mce_panic+0x1e2/0x210
>  [<ffffffff81028873>] do_machine_check+0x843/0xa70
>  [<ffffffff814cb0cc>] machine_check+0x1c/0x30
>  <<EOE>>