From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM02-BL2-obe.outbound.protection.outlook.com (mail-bl2nam02on0041.outbound.protection.outlook.com [104.47.38.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3xnv7Z38jgzDrKL for ; Thu, 7 Sep 2017 18:41:29 +1000 (AEST) From: Joakim Tjernlund To: "linuxppc-dev@lists.ozlabs.org" , "leoyang.li@nxp.com" , "york.sun@nxp.com" Subject: Re: Machine Check in P2010(e500v2) Date: Thu, 7 Sep 2017 08:41:20 +0000 Message-ID: <1504773676.31322.2.camel@infinera.com> References: <1504265576.20777.61.camel@infinera.com> <1504600831.27247.20.camel@infinera.com> <1504729020.27247.120.camel@infinera.com> <1504731221.27247.126.camel@infinera.com> <1504738204.27247.133.camel@infinera.com> In-Reply-To: <1504738204.27247.133.camel@infinera.com> Content-Type: text/plain; charset="iso-8859-15" MIME-Version: 1.0 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote: > On Wed, 2017-09-06 at 21:13 +0000, Leo Li wrote: > > > -----Original Message----- > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@infinera.com] > > > Sent: Wednesday, September 06, 2017 3:54 PM > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; York = Sun > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > >=20 > > > On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote: > > > > > -----Original Message----- > > > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@infinera.com] > > > > > Sent: Wednesday, September 06, 2017 3:17 PM > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; Y= ork > > > > > Sun > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > >=20 > > > > > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote: > > > > > > > -----Original Message----- > > > > > > > From: York Sun > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM > > > > > > > To: Joakim Tjernlund ; linuxpp= c- > > > > > > > dev@lists.ozlabs.org; Leo Li > > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > > > >=20 > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo. > > > > > > >=20 > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote: > > > > > > > > So after some debugging I found this bug: > > > > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_= regs > > >=20 > > > *regs) > > > > > > > > if (is_in_pci_mem_space(addr)) { > > > > > > > > if (user_mode(regs)) { > > > > > > > > pagefault_disable(); > > > > > > > > - ret =3D get_user(regs->nip, &inst); > > > > > > > > + ret =3D get_user(inst, (__u32 __use= r > > > > > > > > + *)regs->nip); > > > > > > > > pagefault_enable(); > > > > > > > > } else { > > > > > > > > ret =3D probe_kernel_address(regs-= >nip, > > > > > > > > inst); > > > > > > > >=20 > > > > > > > > However, the kernel still locked up after fixing that. > > > > > > > > Now I wonder why this fixup is there in the first place? Th= e > > > > > > > > routine will not really fixup the insn, just return 0xfffff= fff > > > > > > > > for the failing read and then advance the process NIP. > > > > > >=20 > > > > > > You are right. The code here only gives 0xffffffff to the load > > > > > > instructions and > > > > >=20 > > > > > continue with the next instruction when the load instruction is > > > > > causing the machine check. This will prevent a system lockup whe= n > > > > > reading from PCI/RapidIO device which is link down. > > > > > >=20 > > > > > > I don't know what is actual problem in your case. Maybe it is = a > > > > > > write > > > > >=20 > > > > > instruction instead of read? Or the code is in a infinite loop = waiting for a > > >=20 > > > valid > > > > > read result? Are you able to do some further debugging with the = NIP > > > > > correctly printed? > > > > > >=20 > > > > >=20 > > > > > According to the MC it is a Read and the NIP also leads to a read= in the > > >=20 > > > program. > > > > > ATM, I have disabled the fixup but I will enable that again. > > > > > Question, is it safe add a small printk when this MC happens(afte= r > > > > > fixing up)? I need to see that it has happened as the error is so= mewhat > > >=20 > > > random. > > > >=20 > > > > I think it is safe to add printk as the current machine check handl= ers are also > > >=20 > > > using printk. > > >=20 > > > I hope so, but if the fixup fires there is no printk at all so I was = a bit unsure. > > > Don't like this fixup though, is there not a better way than faking a= read to user > > > space(or kernel for that matter) ? > >=20 > > I don't have a better idea. Without the fixup, the offending load inst= ruction will never finish if there is anything wrong with the backing devic= e and freeze the whole system. Do you have any suggestion in mind? > >=20 >=20 > But it never finishes the load, it just fakes a load of 0xfffffffff, for = user space I rather have it signal > a SIGBUS but that does not seem to work either, at least not for us but t= hat could be a bug in general MC code > maybe. > This fixup might be valid for kernel only as it has never worked for user= space due to the bug I found. >=20 > Where can I read about this errata ? I have look high and low an cannot find an errata which maps to this fixup. The closest I get is A-005125 which seems to have another workaround, I can= not find any evidence that this workaround has been applied in Linux, can you? Jocke=