From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ashwin Pankaj <ashwin.pankaj@lsi.com>
Subject: Xen PANIC in MCE interrupt context : can global
 variable dom0 be NULL ?
Date: Mon, 15 Feb 2010 19:49:53 +0530
Message-ID: <4B795809.5070304@lsi.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

Hi ,


I am using Xen 3.4.1 - I see that sometimes when an MCE error occurs Xen 
panics due to a page fault with the following stack trace-

http://pastebin.com/f30f67342

  After some digging, probable culprit seems to be smp_cmci_interrupt

> if (bs.errcnt && mctc != NULL) {
>         if (guest_enabled_event(dom0->vcpu[0], 
> <------------------------------------ here
>                      VIRQ_MCA)) {
>             mctelem_commit(mctc);
>             printk(KERN_DEBUG "CMCI: send CMCI to DOM0 through virq\n");
>             send_guest_global_virq(dom0, VIRQ_MCA);
>         } else {
>             x86_mcinfo_dump(mctelem_dataptr(mctc));
>             mctelem_dismiss(mctc);
>        }


Looks like dom0 is NULL here ( vcpu[0] offset is 0x468). Is this possible?

Other functions like mce_softirq() perform a NULL check on dom0 before 
accessing it's members ....
> /* Step2: Send Log to DOM0 through vIRQ */
>         if (dom0 && guest_enabled_event(dom0->vcpu[0], VIRQ_MCA)) {
>             printk(KERN_DEBUG "MCE: send MCE# to DOM0 through virq\n");
>             send_guest_global_virq(dom0, VIRQ_MCA);
>         }

Also note that, this system printed the MCE warning message( "(XEN) MCE: 
The hardware reports a non fatal, correctable incident occured on CPU 0" 
) twice before panicing.

So this code worked properly and entered x86_mcinfo_dump() atleast twice 
before panic.


- Regards,
Ashwin