[Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN
@ 2009-03-20  5:02 Ke, Liping
  2009-03-20 23:46 ` Frank van der Linden
  2009-03-20 23:48 ` Frank van der Linden
  0 siblings, 2 replies; 4+ messages in thread
From: Ke, Liping @ 2009-03-20  5:02 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

Hi, Keir

The patches are for MCA enabling in XEN. Those patches based on AMD and SUN's MCA related jobs.
We have some discussions with AMD/SUN and did refinements from the last sending. Also we rebase it after 
SUN's latest improvements. We will have following patches for recovery actions. This is a basic framework 
for Intel MCA.
 
Some implementation notes:
1) When error happens, if the error is fatal (pcc = 1) or can't be recovered (pcc = 0, yet no good recovery methods),
    for avoiding losing logs in DOM0, we will reset machine immediately. Most of MCA MSRs are sticky. After reboot, 
    MCA polling mechanism will send vIRQ to DOM0 for logging.
2) When MCE# happens, all CPUs enter MCA context. The first CPU who read&clear the error MSR bank will be this
    MCE# owner. Necessary locks/synchronization will help to judge the owner and select most severe error.
3) For convenience, we will select the most offending CPU to do most of processing&recovery job.
4) MCE# happens, we will do three jobs:
    a. Send vIRQ to DOM0 for logging
    b. Send vMCE# to Impacted Guest (Currently Only inject to impacted DOM0)
    c. Guest vMCE MSR virtualization
5) Some further improvement/adds for newer CPUs might be done  later
    a) Connection with recovery actions (cpu/memory online/offline)
    b) More software-recovery identification in severity_scan
    c) More refines and tests for HVM might be done when needed.
 
For discussion details between amd/sun: please refer to the mail thread: 
http://lists.xensource.com/archives/html/xen-devel/2009-02/msg00509.html

Patch Description:
1. intel_mce_base: Basic MCA enabling support For Intel. 
2. vmsr_virtualization: Guest MCE# MSR read/write virtualization support in XEN.
3. interface: xen/dom0 interface, let DOM0 know the recovery details in XEN
    For interface discussion details, please refer to the mail thread:
    http://lists.xensource.com/archives/html/xen-devel/2009-03/msg00322.html
 
About Test:
We did some internal test and the result is just fine.

Any problem, just let me know.
Thanks a lot for your help!
 
Regards,
Criping

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN
  2009-03-20  5:02 [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN Ke, Liping
@ 2009-03-20 23:46 ` Frank van der Linden
  2009-03-20 23:48 ` Frank van der Linden
  1 sibling, 0 replies; 4+ messages in thread
From: Frank van der Linden @ 2009-03-20 23:46 UTC (permalink / raw)
  To: Ke, Liping; +Cc: xen-devel@lists.xensource.com, Keir Fraser

Ke, Liping wrote:

> The patches are for MCA enabling in XEN. Those patches based on AMD and SUN's MCA related jobs.
> We have some discussions with AMD/SUN and did refinements from the last sending. Also we rebase it after 
> SUN's latest improvements. We will have following patches for recovery actions. This is a basic framework 
> for Intel MCA.

I looked the patches over a little more closely, and merged them with my 
-unstable tree. I found a few minor issues:

* some compile issues with printk format strings in the case of DEBUG 
and 32bit
* in severity_scan, use mca_rdmsrl and mca_wrmsrl to work correctly for 
simulated errors using injection
* in severity_scan, if the MSR values were injected for debugging 
purposes, don't panic but keep going, since the injected values will be 
lost at reboot, and this is just a simulated #MC anyway, there is no 
danger of losing state

I'll attach a little patch to fix these issues. I haven't tested this 
patch yet, although the compile fixes have been "tested".

Finally, one final question:

> 2) When MCE# happens, all CPUs enter MCA context. The first CPU who read&clear the error MSR bank will be this
>     MCE# owner. Necessary locks/synchronization will help to judge the owner and select most severe error.

Is it always true (at least, for Intel CPUs of family 6 and 15) that 
when a #MC happens, *all* CPUs will receive a #MC trap? I couldn't find 
this anywhere in the documentation.

If this is true, I'll change the MCE injection code to simulate #MC on 
all CPUs in the case of an Intel system.

- Frank

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN
  2009-03-20  5:02 [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN Ke, Liping
  2009-03-20 23:46 ` Frank van der Linden
@ 2009-03-20 23:48 ` Frank van der Linden
  2009-03-21  5:13   ` Keir Fraser
  1 sibling, 1 reply; 4+ messages in thread
From: Frank van der Linden @ 2009-03-20 23:48 UTC (permalink / raw)
  To: Ke, Liping; +Cc: xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 71 bytes --]

Forgot to attach the patch with the minor fixes.. here it is.

- Frank

[-- Attachment #2: intel-fixes --]
[-- Type: text/plain, Size: 3924 bytes --]

diff --git a/xen/arch/x86/cpu/mcheck/mce_intel.c b/xen/arch/x86/cpu/mcheck/mce_intel.c
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -256,9 +256,10 @@ static int fill_vmsr_data(int cpu, struc
         d->arch.vmca_msrs.nr_injection++;
 
         printk(KERN_DEBUG "MCE: Found error @[CPU%d BANK%d "
-                "status %lx addr %lx domid %d]\n ",
+                "status %p addr %p domid %d]\n ",
                 entry->cpu, mc_bank->mc_bank,
-                mc_bank->mc_status, mc_bank->mc_addr, mc_bank->mc_domid);
+                _p(mc_bank->mc_status), _p(mc_bank->mc_addr),
+                mc_bank->mc_domid);
     }
     return 0;
 }
@@ -426,7 +427,7 @@ static void severity_scan(void)
      * recovered, we need to RESET for avoiding DOM0 LOG missing
      */
     for ( i = 0; i < nr_mce_banks; i++) {
-        rdmsrl(MSR_IA32_MC0_STATUS + 4 * i , status);
+        mca_rdmsrl(MSR_IA32_MC0_STATUS + 4 * i , status);
         if ( !(status & MCi_STATUS_VAL) )
             continue;
         /* MCE handler only handles UC error */
@@ -434,7 +435,12 @@ static void severity_scan(void)
             continue;
         if ( !(status & MCi_STATUS_EN) )
             continue;
-        if (status & MCi_STATUS_PCC)
+        /*
+         * If this was an injected error, keep going, since the
+         * interposed value will be lost at reboot.
+         */
+        if (status & MCi_STATUS_PCC && intpose_lookup(smp_processor_id(),
+          MSR_IA32_MC0_STATUS + 4 * i, NULL) == NULL)
             mc_panic("pcc = 1, cpu unable to continue\n");
     }
 
@@ -519,8 +525,8 @@ static void intel_machine_check(struct c
 
     /* Pick one CPU to clear MCIP */
     if (!test_and_set_bool(mce_process_lock)) {
-        rdmsrl(MSR_IA32_MCG_STATUS, gstatus);
-        wrmsrl(MSR_IA32_MCG_STATUS, gstatus & ~MCG_STATUS_MCIP);
+        mca_rdmsrl(MSR_IA32_MCG_STATUS, gstatus);
+        mca_wrmsrl(MSR_IA32_MCG_STATUS, gstatus & ~MCG_STATUS_MCIP);
 
         if (worst >= 3) {
             printk(KERN_WARNING "worst=3 should have caused RESET\n");
@@ -843,7 +849,7 @@ int intel_mce_wrmsr(u32 msr, u32 lo, u32
                 break;
             }
             d->arch.vmca_msrs.mcg_status = value;
-            printk(KERN_DEBUG "MCE: wrmsr MCG_CTL %lx\n", value);
+            printk(KERN_DEBUG "MCE: wrmsr MCG_CTL %p\n", _p(value));
             break;
         case MSR_IA32_MC0_CTL2:
         case MSR_IA32_MC1_CTL2:
@@ -905,7 +911,7 @@ int intel_mce_wrmsr(u32 msr, u32 lo, u32
                 }
                 printk(KERN_DEBUG "MCE: wmrsr mci_status in vMCE# context\n");
             }
-            printk(KERN_DEBUG "MCE: wrmsr mci_status val:%lx\n", value);
+            printk(KERN_DEBUG "MCE: wrmsr mci_status val:%p\n", _p(value));
             break;
     }
     spin_unlock(&mce_locks);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2215,8 +2215,8 @@ static int emulate_privileged_op(struct 
                 break;
             if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
                 if ( intel_mce_wrmsr(regs->ecx, eax, edx) != 0) {
-                    gdprintk(XENLOG_ERR, "MCE: vMCE MSRS(%lx) Write"
-                        " (%x:%x) Fails! ", regs->ecx, edx, eax);
+                    gdprintk(XENLOG_ERR, "MCE: vMCE MSRS(%p) Write"
+                        " (%x:%x) Fails! ", _p(regs->ecx), edx, eax);
                     goto fail;
                 }
                 break;
@@ -2313,7 +2313,7 @@ static int emulate_privileged_op(struct 
 
             if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
                 if ( intel_mce_rdmsr(regs->ecx, &eax, &edx) != 0)
-                    printk(KERN_ERR "MCE: Not MCE MSRs %lx\n", regs->ecx);
+                    printk(KERN_ERR "MCE: Not MCE MSRs %p\n", _p(regs->ecx));
             }
 
             break;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN
  2009-03-20 23:48 ` Frank van der Linden
@ 2009-03-21  5:13   ` Keir Fraser
  0 siblings, 0 replies; 4+ messages in thread
From: Keir Fraser @ 2009-03-21  5:13 UTC (permalink / raw)
  To: Frank van der Linden, Ke, Liping; +Cc: xen-devel@lists.xensource.com

I already did some fixing up. Please send a patch with your remaining fixes.
It probably won't get applied until I get back from holiday, however.

 -- Keir

On 20/03/2009 23:48, "Frank van der Linden" <Frank.Vanderlinden@Sun.COM>
wrote:

> Forgot to attach the patch with the minor fixes.. here it is.
> 
> - Frank

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-03-21  5:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-20  5:02 [Patch 0/3]RAS(Part II)--Intel MCA enalbing in XEN Ke, Liping
2009-03-20 23:46 ` Frank van der Linden
2009-03-20 23:48 ` Frank van der Linden
2009-03-21  5:13   ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.